An I/O-efficient and adaptive fault-tolerant framework for distributed graph computations

作者:Wang, Zhigang; Gu, Yu; Bao, Yubin; Yu, Ge; Gao, Lixin
来源:Distributed and Parallel Databases, 2017, 35(2): 177-196.
DOI:10.1007/s10619-017-7192-2

摘要

In recent year, many large-scale iterative graph computation systems such as Pregel have been developed. To ensure that these systems are fault-tolerant, checkpointing, which archives graph states onto distributed file systems periodically, has been proposed. However, fault-tolerance remains to be challenging because the whole data set is archived with a static interval, rendering underlying graph computations to entail I/O-costs in terms of disk and network communication. Motivated by this, we first propose to dynamically adjust checkpoint intervals based on a carefully designed cost-analysis model, by taking the underlying computing workload into account. Furthermore, for algorithms that can be restarted from any point during computations, we prioritize graph states and then checkpointing can be performed with selected data, instead of the entire dataset, to reduce archiving overhead while simultaneously guaranteeing the failure recovery efficiency. Finally, we conduct extensive performance studies to confirm the effectiveness of our approaches over existing up-to-date solutions using a broad spectrum of real-world graphs.