A distributed data management system to support large-scale data analysis

Emara, Tamer Z.; Huang, Joshua Zhexue<sup>*</sup>

doi:10.1016/j.jss.2018.11.007

摘要

Distributed data management is a key technology to enable efficient massive data processing and analysis in cluster-computing environments. Specifically, in environments where the data volumes are beyond the system capabilities, big data files are required to be summarized by representative samples with the same statistical properties as the whole dataset. This paper proposes a big data management system (BDMS) based on distributed random sample data blocks. It presents a high-level architecture design of the BDMS which extends the current distributed file systems. This system offers certain functionalities for block-level management such as statistically-aware data partitioning, data blocks organization, and data blocks selection. This paper also presents a round-random partitioning scheme to represent a big dataset as a set of non-overlapping data blocks; each block is a random sample of the whole dataset. Based on the presented scheme, two algorithms are introduced as an implementation strategy to convert the HDFS blocks of a big file into a set of random sample data blocks which is also stored in HDFS. The experimental results show that the execution time of partitioning operation is acceptable in the real applications because this operation is only performed once on each input data file.

出版日期2019-2
单位深圳大学

全文

访问全文

收藏分享被引(17) 浏览

更新时间：2024-04-19 00:24

A distributed data management system to support large-scale data analysis

摘要

全文

产品服务

站内浏览

服务支持

联系方式

科研之友