Automated Web Page Content Extraction Method Based on Document Object Model

Li Tongyu; Ren Rui; Cai Hongming<sup>*</sup>; Jiang Lihong

doi:10.16183/j.cnki.jsjtu.2018.10.027

摘要

Web content extraction has great engineering and application value in the fields of information retrieval, text analysis and network resource data processing. In view of the problem of web content extraction caused by useless information on web pages and the heterogeneity of web page structures, this paper proposes an automated web page content extraction method based on Document Object Model (DOM). Firstly, for DOMs generated from original web pages, we remove useless nodes from them and then compress the models, which facilitates subsequent processing. Then, we identify the web page content based on text and hyperlink density. Finally, we identify the noise hyperlinks based on node entropy and remove them from the content. The experimental results show that compared with the traditional methods of web page content extraction, the accuracy and F1score of our method are obviously improved while there is only a slight decline on recall.

出版日期2018
单位上海交通大学

全文

访问全文

收藏分享被引浏览

更新时间：2021-05-08 04:36

Automated Web Page Content Extraction Method Based on Document Object Model

摘要

全文

产品服务

站内浏览

服务支持

联系方式

科研之友