A genetic programming framework to schedule webpage updates

作者:Santos Aecio S R; de Carvalho Cristiano R; Almeida Jussara M*; de Moura Edleno S; da Silva Altigran S; Ziviani Nivio
来源:Information Retrieval Journal, 2015, 18(1): 73-94.
DOI:10.1007/s10791-014-9248-5

摘要

The quality of a Web search engine is influenced by several factors, including coverage and the freshness of the content gathered by the web crawler. Focusing particularly on freshness, one key challenge is to estimate the likelihood of a previously crawled webpage being modified. Such estimates are used to define the order in which those pages should be visited, and thus, can be exploited to reduce the cost of monitoring crawled webpages for keeping updated versions. We here present a Genetic Programming framework, called -Genetic Programming for Crawling, to generate score functions that produce accurate rankings of pages regarding their probabilities of having been modified. We compare with state-of-the-art methods using a large dataset of webpages crawled from the Brazilian Web. Our evaluation includes multiple performance metrics and several variations of our framework, built from exploring different sets of terminals and fitness functions. In particular, we evaluate using the ChangeRate and Normalized Discounted Cumulative Gain (NDCG) metrics as both objective function and evaluation metric. We show that, in comparison with ChangeRate, NDCG has the ability of better evaluating the effectiveness of scheduling strategies, since it is able to take the ranking produced by the scheduling into account.

  • 出版日期2015-2