A survey of Web page cleaning research

Mao Xianling<sup>*</sup>; He Jing; Yan Hongfei

摘要

The rapid development of the Internet has made a variety of Web applications and Web data, which become the major source of data for lots of research. Web page includes a variety of content, such as advertising, navigation bar, related links, text, etc. However, for different studies and applications, not all content is necessary; oppositely, the unrelated content will affect the effectiveness and efficiency of the research and applications. So Web page cleaning is a highlighted topic of information retrieval with booming search engines. Thus it is necessary to sum up the field on the page de-noise, in order to better carry out in-depth study. Firstly, this paper gives a brief introduction to the necessity of Web page cleaning and its related concepts. The authors present a classification hierarchy of the Web page cleaning methods, including the single-model based Web page cleaning methods and the multi-model based Web page cleaning methods. Then, this paper summarizes all kinds of Web page cleaning techniques and frameworks, including SST, Shingle, Pagelet, DSE, etc. Thirdly, this paper describes the experimental datasets and experimental methods used in all kinds of Web page cleaning techniques. Finally, this paper discusses the existing problems and the future directions in the Web page cleaning field.

出版日期2010
单位北京大学

全文

访问全文

收藏分享被引浏览

更新时间：2018-08-07 12:16

A survey of Web page cleaning research

摘要

全文

产品服务

站内浏览

服务支持

联系方式

科研之友