A Fault-Tolerant Framework for Asynchronous Iterative Computations in Cloud Environments

作者:Wang, Zhigang*; Gao, Lixin; Gu, Yu; Bao, Yubin; Yu, Ge
来源:IEEE Transactions on Parallel and Distributed Systems, 2018, 29(8): 1678-1692.
DOI:10.1109/TPDS.2018.2808519

摘要

Most graph algorithms are iterative in nature. They can be processed by distributed systems in memory in an efficient asynchronous manner. However, it is challenging to recover from failures in such systems. This is because traditional checkpoint fault-tolerant frameworks incur expensive barrier costs that usually offset the gains brought by asynchronous computations. Worse, surviving data are rolled back, leading to costly re-computations. This paper first proposes to leverage surviving data for failure recovery in an asynchronous system. Our framework guarantees the correctness of algorithms and avoids rolling back surviving data. Additionally, a novel asynchronous checkpointing solution is introduced to accelerate recovery at the price of nearly zero overheads. Some optimization strategies like message pruning, non-blocking recovery and load balancing are also designed to further boost the performance. We have conducted extensive experiments to show the effectiveness of our proposals using real-world graphs.