摘要

By combining the binomial distribution with the small probability principle, a dynamic double-threshold filter for Minwise Hash was proposed. The filter divides compared progress into several compared point and sets dynamic threshold on them to filter documents with similarity under lower threshold TL(k) and output documents with similarity exceed upper threshold TU(k). It decreases the comparing time and workload that follow. The experimental results demonstrate that the dynamic double-threshold filter requires only 31% of CPU running time of Minwise Hash or 36% of b-bit Minwise Hash. Besides, the performance of the filter is enhanced with a set of suitable parameters. The filter applies not only to Minwise Hash, but also to its mutation algorithm (e.g. b-bit Minwise Hash), and even all estimators conforming to binomial distribution.

全文