Dependency-Aware Data Locality for MapReduce

作者:Ma, Xiaoqiang*; Fan, Xiaoyi; Liu, Jiangchuan; Li, Dan
来源:IEEE Transactions on Cloud Computing, 2018, 6(3): 667-679.
DOI:10.1109/TCC.2015.2511765

摘要

MapReduce effectively partitions and distributes computation workloads to a cluster of servers, facilitating today's big data processing. Given the massive data to be dispatched, and the intermediate results to be collected and aggregated, there have been a significant studies on data locality that seeks to co-locate computation with data, so as to reduce cross-server traffic in MapReduce. They generally assume that the input data have little dependency with each other, which however is not necessarily true for that of many real-world applications, and we show strong evidence that the finishing time of MapReduce tasks can be greatly prolonged with such data dependency. In this paper, we present Dependency-Aware Locality for MapReduce (DALM) for processing the real-world input data that can be highly skewed and dependent. DALM accommodates data-dependency in a data-locality framework, organically synthesizing the key components from data reorganization, replication, placement. Beside algorithmic design within the framework, we have also closely examined the deployment challenges, particularly in public virtualized cloud environments, and have implemented DALM on Hadoop 1.2.1 with Giraph 1.0.0. Its performance has been evaluated through both simulations and real-world experiments, and compared with that of state-of-the-art solutions.