摘要

In the last decade, skyline query processing has become widely important because of its usefulness in decision making applications. Since the size of the datasets used for skyline query processing are huge, algorithms for MapReduce-based skyline query processing have been widely studied. However, existing algorithms suffer from low-filtering efficiency for local skyline computation, and unrealistically assume both uniform data distributions and dimensional independence. In this paper, we propose a parallel skyline query processing algorithm for MapReduce using multiple regression analysis. The goal of our algorithm is to efficiently find a set of skylines from a large dataset by reducing the number of candidates prior to the skyline computation. To develop the skyline computation algorithm on anti-correlated datasets, we computed a data filtering threshold line based on a multiple regression analysis of the sampled dataset. To guarantee the accuracy of the skyline result, we considered both a filtering threshold line and a grid-based cell dominance condition. Thus, only relevant data could be computed in the real skyline computation step. For local skyline computation, we utilized an angle-based partitioning of data space that effectively eliminates non-promising points in partitions. For the global skyline computation, we used the dominance relationship among grid-based partitions to prune out unnecessary skyline points. Performance analyses showed that our parallel skyline query processing algorithm outperformed existing algorithms, under various settings.

  • 出版日期2017-12