摘要

Nowadays, hardware reliability is considered a first-class issue along with performance and energy efficiency. The increasing scaling technology and subsequent supply voltage reductions, together with temperature fluctuations, augment the susceptibility of architectures to errors. %26lt;br%26gt;With the development of CMPs, the interest for using parallel applications has increased. Previous proposals for providing fault detection and recovery have been mainly based on redundant execution over different cores. RMT (Redundant Multi-Threading) is a family of techniques based on SMT (Simultaneous Multi-Threading) processors in which two independent threads (master and slave), fed with the same inputs, redundantly execute the same instructions, in order to detect faults by checking their outputs. In this paper, we study the under-explored architectural support of RMT techniques to reliably execute shared-memory applications in tiled-CMPs. %26lt;br%26gt;Initially, we show how atomic operations induce serialization points between master and slave threads, degrading the execution time by 35% for several parallel scientific and multimedia benchmarks. To address this issue, we introduce REPAS (Reliable Execution of Parallel ApplicationS in tiled-CMPs), a novel RMT mechanism to provide reliable execution in shared-memory applications in environments prone to transient faults. REPAS architecture only needs few extra hardware since the redundant execution is performed within 2-way SMT cores in which the majority of hardware is shared. Experimental results show that REPAS is able to provide fault tolerance against soft errors with a lower execution time overhead (around 25% including the cost of redundancy) in comparison to a non-redundant system than previous proposals while using less hardware resources. Additionally, we show that REPAS supports huge fault ratios with negligible impact on performance (less than 2% for a fault ratio of 100 faults per million cycles).

  • 出版日期2012-9

全文