摘要

Aiming at the low efficiency of multi-topic crawling, the difference between built-in search engines (BSEs) and general search engines (GSEs) is investigated. The idea and method of dividing topic rules into atomic rules are proposed respectively, and three relations (equating relation, exchanging relation and containing relation) between atomic rules are analyzed. Based on atomic rule relations, the different allocation strategies for BSEs and GSEs are designed, which can not only improve the precision of topic-specific crawling, but also reduce crawling times. Furthermore, a method of sentence cluster-based relevance computing between topics and documents is proposed to solve the low precision problem of directly crawling information by atomic rules. We conduct an experiment with 138 topic rules (containing 8223 atomic rules), 14 BSEs and 4 GSEs for evaluating the number of crawling information and related information in unit time. The results show that the proposed method offers more effective performances.

全文