A new parallel recomputing code design methodology for fast failure recovery

作者:Du, Yunfei*; Tang, Yuhua; Xie, Xinwei
来源:Computers & Electrical Engineering, 2013, 39(4): 1095-1113.
DOI:10.1016/j.compeleceng.2013.01.010

摘要

As the size of large-scale computer systems increases, their mean-time-between-failures are becoming significantly shorter than the execution time of many current scientific applications. Fault-tolerant parallel algorithm (FTPA) is an application-level fault-tolerant approach that can achieve fast self-recovery by parallel recomputing. The method of parallelizing the loops has been used to design the parallel recomputing code for FTPA in our prior work. In the present paper, we first propose a new parallel recomputing code design methodology. Second, the parallel recomputing code design methodology is automated by exploring the use of compiler technology. Finally, we evaluate the performance of our approach with five programs on Tianhe-1A. The experimental results show that the parallel recomputing code generated by the new method has a higher efficiency of parallel recomputing.