A Template Independent Approach forWeb News and Blog Content Extraction

Ma Xueyang<sup>*</sup>; Zhang Hongli; Yu Xiangzhan; Li Yingjun

doi:10.1109/ICISCE.2016.36

摘要

The Web has become a large platform for information publishing and consuming. Web news and blog are both representative information sources providing convenient ways to keep informed. In addition to the main content, most web pages also contain navigation panels, advertisements, recommended articles etc. Effectively extracting news and blog content and filtering these noises is necessary and challenging. In this paper we propose a news and blog content extraction approach that is portable to different languages and various domains. Our extensive case studies shows that characters which are not anchor texts but contain stop words are more likely to be the genuine content. Our method first traverses the entire DOM tree and count these valid characters attached to each DOM node. Then we step into the most representative child node based on valid characters recursively. And we finally stop at the main content node with a predefined criterion. To validate the approach, we conduct experiments by using online news and blog files randomly selected from well-known Chinese and English websites. Experimental result shows that our method achieves 96% F-1-measure on average and outperforms CETR.

出版日期2016
单位哈尔滨工业大学

全文

访问全文

收藏分享被引(2) 浏览

更新时间：2022-08-17 20:01

A Template Independent Approach forWeb News and Blog Content Extraction

摘要

全文

产品服务

站内浏览

服务支持

联系方式

科研之友