A Template Independent Approach forWeb News and Blog Content Extraction

作者:Ma Xueyang*; Zhang Hongli; Yu Xiangzhan; Li Yingjun
来源:3rd International Conference on Information Science and Control Engineering (ICISCE), 2016-07-08 To 2016-07-10.
DOI:10.1109/ICISCE.2016.36

摘要

The Web has become a large platform for information publishing and consuming. Web news and blog are both representative information sources providing convenient ways to keep informed. In addition to the main content, most web pages also contain navigation panels, advertisements, recommended articles etc. Effectively extracting news and blog content and filtering these noises is necessary and challenging. In this paper we propose a news and blog content extraction approach that is portable to different languages and various domains. Our extensive case studies shows that characters which are not anchor texts but contain stop words are more likely to be the genuine content. Our method first traverses the entire DOM tree and count these valid characters attached to each DOM node. Then we step into the most representative child node based on valid characters recursively. And we finally stop at the main content node with a predefined criterion. To validate the approach, we conduct experiments by using online news and blog files randomly selected from well-known Chinese and English websites. Experimental result shows that our method achieves 96% F-1-measure on average and outperforms CETR.