摘要

The language modeling approach to information retrieval has recently attracted much attention. In the language modeling retrieval models, we can score and rank documents based on the query likelihood method. From the theoretical perspective, however, the justification of the existing (standard) query likelihood method based on the probability ranking principle requires an unrealistic assumption about the generation of a "negative query" from a document, which states that the probability that a user who dislikes a document would use a query does not depend on the particular This assumption enables ignoring the negative query generation so as to justify using the basic query likelihood method as a retrieval function. In reality, however, this assumption does not hold because a user who dislikes a document would more likely avoid using words in the document when posing a query. This suggests that the standard query likelihood function is a potentially non-optimal retrieval function. In this paper, we attempt to improve the standard language modeling retrieval models by bringing back the component of negative query generation. Specifically, we propose a general and efficient approach to estimate document-dependent probabilities of negative query generation based on the principle of maximum entropy, and derive a more complete query likelihood retrieval function that also contains the negative query generation component. In addition, we further develop a more general probabilistic distance retrieval method to naturally incorporate query language models, which covers the proposed query likelihood with negative query generation as its special case. The proposed approaches not only bridge the theoretic gap between the standard language modeling retrieval models and the notion of relevance, but also improves the retrieval effectiveness with (almost) no additional computational cost.

  • 出版日期2015-8
  • 单位Microsoft