摘要

Relevancy Context Graph (RCG) assigns an order to unvisited web page by establishing context graph and two language models (general language model and topic language model). But RCG doesn't consider more structural information between web pages in context graph. In this paper, we optimize RCG considering more information about link affiliation of web pages and applying the idea of link prediction which belongs to social network knowledge to enhance the performance of topic-specific crawlers. Moreover, by computing semantic similarity of a pair of web pages which have farther-son relationship, poor quality web pages are removed to guarantee the effect of RCG. We perform an experimental comparison of the optimized method against the original algorithm. The experimental results indicate that our method outperforms the non-optimized RCG.

全文