摘要

Reservoir sampling is a known technique for maintaining a random sample of a fixed size over a data stream of an unknown size. While reservoir sampling is suitable for applications demanding a sample over the whole data stream, it is not designed for applications in which an input stream is composed of sub-streams with heterogeneous statistical properties. For this class of applications, the conventional reservoir sampling technique can lead to a potential damage in the statistical quality of the sample because it does not guarantee the inclusion of a statistically sufficient number of tuples in the sample from each sub-stream. In this paper, we address this heterogeneity problem by stratifying the reservoir sample among the underlying sub-streams. This stratification poses two challenges. First, a fixed-size reservoir should be allocated to individual sub-streams optimally, specifically to have the stratified reservoir sample used to generate estimates at the level of either the whole data stream or the individual sub-streams. Second, the allocation should be adjusted adaptively if and when new sub-streams appear in or existing sub-streams disappear from the input stream and as their statistical properties change. We propose a novel adaptive stratified reservoir sampling algorithm designed to meet these challenges. An extensive performance study shows the superiority of the achieved sample quality and demonstrates the adaptivity of the proposed sampling algorithm.

  • 出版日期2014-1