An efficient method to evaluate intersections on big data sets

Chen Yangjun<sup>*</sup>; Shen Weixin

doi:10.1016/j.tcs.2016.07.018

摘要

Set intersections are important in computer science. Especially, intersection of inverted lists is a fundamental operation in information retrieval for text databases and Web search engines. In this paper, we discuss an efficient and.effective way to implement this operation in the context of very big data sets. The main idea behind it is to do a binary search over sorted interval sequences, each of which corresponds to an inverted list and is constructed by establishing a trie over the sequences of set identifiers as well as a kind of tree encoding, by which each node in the trie is assigned an interval. In many cases, an interval sequence is much shorter than its corresponding inverted list. In particular, the lowest common ancestors of intervals in a trie can be utilized to control a binary search to skip over useless interval containment checks, which enables us to reach an optimal off-line algorithm to do the task, and is theoretically better than any traditional on-line methods (at cost of more space). Experiments have been conducted, showing that the trade-off of space for time is worthwhile.

出版日期2016-9-27

全文

访问全文

收藏分享被引(3) 浏览

更新时间：2021-03-24 20:52

An efficient method to evaluate intersections on big data sets

摘要

全文

产品服务

站内浏览

服务支持

联系方式

科研之友