Algorithm-Based Recovery for HPL

作者:Davies Teresa*; Chen Zizhong; Karlsson Christer; Liu Hui
来源:ACM Sigplan Notices, 2011, 46(8): 303-304.
DOI:10.1145/2038037.1941600

摘要

When more processors are used for a calculation, the probability that one will fail during the calculation increases. Fault tolerance is a technique for allowing a calculation to survive a failure, and includes recovering lost data. A common method of recovery is diskless checkpointing. However, it has high overhead when a large amount of data is involved, as is the case with matrix operations. A checksum-based method allows fault tolerance of matrix operations with lower overhead. This technique is applicable to the LU decomposition in the benchmark HPL.

  • 出版日期2011-8

全文