ApproxSSD: Data Layout Aware Sampling on an Array of SSDs

Zhou, Jian<sup>*</sup>; Wu, Huafeng; Wang, Jun

doi:10.1109/TC.2018.2871116

摘要

Execution of analytic frameworks on sample data sets is the current trend in response to increasing data size and demand for real-time analysis. Additionally, high-performance, energy-efficient Solid-State Drive (SSD) arrays are the primary storage subsystem for parallel data analysis systems. To exploit the benefits of SSD arrays when executing sample data set analytics, several key areas must be considered. First, due to logical to physical address translation, random data choice in data sampling jobs can cause unbalanced workloads among SSDs in the array. Second, after the data choice, existing task schedulers in data analysis frameworks can introduce non-negligible resource contentions resulting from the suboptimal Input/Output (I/O). The performance of SSDs is unpredictable because of their varying maintenance costs at runtime, which renders them hard to be managed by the scheduler. With the trend towards sample set data analytics and the use of SSDs, it is increasingly important to ensure balanced workloads and minimize resource contention. Without addressing these areas, sample-set data analytics on SSDs will continue to suffer from performance inefficiencies. In this paper, we propose ApproxSSD to perform on-disk layout-aware data sampling on SSD arrays. This proposed framework leverages data selection and task scheduling to improve the performance of many applications. ApproxSSD decouples I/O from the computation in task execution. This avoids potential I/O contentions and suboptimal workload balances. We have developed an open-source prototype system of ApproxSSD in Scala at Github. Our evaluation shows that ApproxSSD can achieve up to 2.7 times speed up at 10 percent sampling ratio under an example sampling workload when compared to Spark, while simultaneously maintaining high output accuracy.

出版日期2019-4
单位上海海事大学

全文

访问全文

收藏分享被引浏览

更新时间：2021-07-10 20:00

ApproxSSD: Data Layout Aware Sampling on an Array of SSDs

摘要

全文

产品服务

站内浏览

服务支持

联系方式

科研之友