Addressing Transient and Permanent Faults in NoC With Efficient Fault-Tolerant Deflection Router

作者:Feng, Chaochao*; Lu, Zhonghai; Jantsch, Axel; Zhang, Minxuan; Xing, Zuocheng
来源:IEEE Transactions on Very Large Scale Integration Systems, 2013, 21(6): 1053-1066.
DOI:10.1109/TVLSI.2012.2204909

摘要

Continuing decrease in the feature size of integrated circuits leads to increases in susceptibility to transient and permanent faults. This paper proposes a fault-tolerant solution for a bufferless network-on-chip, including an on-line fault-diagnosis mechanism to detect both transient and permanent faults, a hybrid automatic repeat request, and forward error correction link-level error control scheme to handle transient faults and a reinforcement-learning-based fault-tolerant deflection routing (FTDR) algorithm to tolerate permanent faults without deadlock and livelock. A hierarchical-routing-table-based algorithm (FTDR-H) is also presented to reduce the area overhead of the FTDR router. Synthesized results show that, compared with the FTDR router, the FTDR-H router can reduce the area by 27% in an 8 x 8 network. Simulation results demonstrate that under synthetic workloads, in the presence of permanent link faults, the throughput of an 8 x 8 network with FTDR and FTDR-H algorithms are 14% and 23% higher on average than that with the fault-on-neighbor (FoN) aware deflection routing algorithm and the cost-based deflection routing algorithm, respectively. Under real application workloads, the FTDR-H algorithm achieves 20% less hop counts on average than that of the FoN algorithm. For transient faults, the performance of the FTDR router can achieve graceful degradation even at a high fault rate. We also implement the fault-tolerant deflection router which can achieve 400 MHz in TSMC 65-nm technology.