摘要

Reliability is a critical metric in the design and development of replication-based big data storage systems such as Hadoop File System (HDFS). In the system with thousands of machines and storage devices, even in-frequent failures become likely. In Google File System, the annual disk failure rate is 2.88%, which means that you were expected to see 8760 disk failures in a year. Unfortunately, given an increasing number of node failures, how often a cluster starts losing data when being scaled out is not well investigated. Moreover, there is no systemic method that can be used to quantify the reliability for multi way replication based data placement methods, which has been widely used in enterprise large-scale storage systems to improve the I/O parallelism. In this paper, we develop a new reliability model by incorporating the probability of replica loss to investigate the system reliability of multi-way declustering data layouts and analyze their potential parallel recovery possibilities. Our comprehensive simulation results on Matlab and SHARPE show that the shifted declustering data layout outperforms the random declustering layout in a multi-way replication scale-out architecture, in terms of data loss probability and system reliability by up to 63% and 85%, respectively. Our study on both 5-year and 10-year system reliability equipped with various recovery bandwidth settings shows that the shifted declustering layout surpasses the two baseline approaches in both cases by consuming up to 79% and 87% less recovery bandwidth for copyset, as well as 4.8% and 10.2% less recovery bandwidth for random layout.