A Concurrent Partial Snapshot Algorithm for Large-Scale and Dynamic Distributed Systems

作者:Kim Yonghwan*; Araragi Tadashi; Nakamura Junya; Masuzawa Toshimitsu
来源:IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2014, E97D(1): 65-76.
DOI:10.1587/transinf.E97.D.65

摘要

Checkpoint-rollback recovery, which is a universal method for restoring distributed systems after faults, requires a sophisticated snapshot algorithm especially if the systems are large-scale, since repeatedly taking global snapshots of the whole system requires unacceptable communication cost. As a sophisticated snapshot algorithm, a partial snapshot algorithm has been introduced that takes a snapshot of a subsystem consisting only of the nodes that are communication-related to the initiator instead of a global snapshot of the whole system. In this paper, we modify the previous partial snapshot algorithm to create a new one that can take a partial snapshot more efficiently, especially when multiple nodes concurrently initiate the algorithm. Experiments show that the proposed algorithm greatly reduces the amount of communication needed for taking partial snapshots.

  • 出版日期2014-1