摘要

The rapid growth of the large-scale World-Wide Web poses great challenge to existing focused crawling methods. Whetheranalyzing text content or link structure, traditional focused crawler were mainly based on the page granularity. Random walking in the network composed of a large number of pages, the focused crawler is easy to get lost. Obviously, narrowing the focused crawling range from the entireWeb can improve the precision and efficiency. A focused crawling method based on the twogranularitiesis put forward. Firstly, using detectingcommunity algorithm to analyze the link structure of the network composed of websites, a given topic web sites group is built up. It contributes to narrow the crawling range. Secondly, all topic relevant analysis for web pages and link prediction are performed inside this generated group. Topic relevant analysis is implemented through calculating the topic similarity for title and content separately. The similarity of father pages, anchor texts and the string text for URL all are considered to predict the topic relevance for unknown links. The experimental results suggest that this method is very effective for given topic, and it can improve the precision.

  • 出版日期2015

全文