A novel incremental parallel web crawler based on focused crawling

Huang Qiuyan; Li Qingzhong<sup>*</sup>; Yan Zhongmin; Fu Hong

摘要

With the tremendous growth of the Web, it has become a huge challenge for the all-purpose singleprocess crawlers to locate the resources that are precise and relevant in an appropriate amount of time, so more enhanced and convincing algorithms are in demand. In this paper, a novel incremental parallel Web crawler based on focused crawling is proposed, which can crawl the Web pages that are relevant to multiple pre-defined topics concurrently. Furthermore, to solve the issue of URL distribution, a compound decision model based on multi-objective decision making method is introduced, which considers multiple factors synthetically such as load balance, relevance and so on; and to solve the issue of update frequency of local repository decision, a update frequency graph model is presented, in which the graph is constructed dynamically according to the update frequency of Web pages. The extensive experiments show that our proposed system can acquire high quality, high relevance and high freshness Web information efficiently.

出版日期2013

全文

访问全文

收藏分享被引浏览

更新时间：2018-08-03 13:00

A novel incremental parallel web crawler based on focused crawling

摘要

全文

产品服务

站内浏览

服务支持

联系方式

科研之友