摘要

Focused crawlers compute priorities of unvisited hyperlinks by using linear combination of their full text similarities and anchor text similarities. However, combination factors are manually determined to be 0.5, and they lack objectivity and authenticity. To address this problem, this paper proposes an intelligent focused crawler based on Genetic Algorithm (GAFC). The GAFC firstly acquires the optimal combination factors to make errors between predicted and actual topical similarities minimum based on the selection rule, the intersect rule and the variation rule. Secondly, the GAFC computes topical similarities of full texts and anchor texts based on the vector space model (VSM). Finally, the GAFC integrates optimal combination factors and topical similarities to predict topical similarities of unvisited hyperlinks. The experiment indicates that the performance of focused crawlers is enhanced by using optimal combination factors. In conclusion, the mentioned method is effective and significant for focused crawlers.

  • 出版日期2014

全文