A software scheduling solution to avoid corrupted units on GPUs

Defour David; Petit Eric<sup>*</sup>

doi:10.1016/j.jpdc.2016.01.001

摘要

Massively parallel processors provide high computing performance by increasing the number of concurrent execution units. Moreover, the transistor technology evolves to higher density, higher frequency and lower voltage. The combination of these factors increases significantly the probability of hardware failures. In this paper, we present a methodology to locate and mitigate hardware failures of Nvidia GPUs. Results show that intermittent errors can be precisely localized and have a limited impact to a well defined architecture tile. Therefore, we propose, and demonstrate on a software prototype, a rescheduling strategy to quarantine the defective hardware and ensure correct execution. Our approach significantly improves the GPU fault-tolerance capability and GPU's lifespan, at a reasonable overhead.

出版日期2016-4

全文

访问全文

收藏分享被引(4) 浏览

更新时间：2024-01-04 22:17

A software scheduling solution to avoid corrupted units on GPUs

摘要

全文

产品服务

站内浏览

服务支持

联系方式

科研之友