An improved partitioning mechanism for optimizing massive data analysis using MapReduce

Slagter, Kenn; Hsu, Ching-Hsien<sup>*</sup>; Chung, Yeh-Ching; Zhang, Daqiang

doi:10.1007/s11227-013-0924-9

摘要

In the era of Big Data, huge amounts of structured and unstructured data are being produced daily by a myriad of ubiquitous sources. Big Data is difficult to work with and requires massively parallel software running on a large number of computers. MapReduce is a recent programming model that simplifies writing distributed applications that handle Big Data. In order for MapReduce to work, it has to divide the workload among computers in a network. Consequently, the performance of MapReduce strongly depends on how evenly it distributes this workload. This can be a challenge, especially in the advent of data skew. In MapReduce, workload distribution depends on the algorithm that partitions the data. One way to avoid problems inherent from data skew is to use data sampling. How evenly the partitioner distributes the data depends on how large and representative the sample is and on how well the samples are analyzed by the partitioning mechanism. This paper proposes an improved partitioning algorithm that improves load balancing and memory consumption. This is done via an improved sampling algorithm and partitioner. To evaluate the proposed algorithm, its performance was compared against a state of the art partitioning mechanism employed by TeraSort. Experiments show that the proposed algorithm is faster, more memory efficient, and more accurate than the current implementation.

出版日期2013-10
单位清华大学; 同济大学

全文

访问全文

收藏分享被引(27) 浏览

更新时间：2024-04-11 23:34

An improved partitioning mechanism for optimizing massive data analysis using MapReduce

摘要

全文

产品服务

站内浏览

服务支持

联系方式

科研之友