Fault-Tolerant Dynamic Rescheduling for Heterogeneous Computing Systems

Mei, Jing; Li, Kenli; Zhou, Xu; Li, Keqin

doi:10.1007/s10723-015-9331-1

摘要

As the scale and complexity of heterogeneous computing systems grow, failures occur frequently and have an adverse effect on solving large-scale applications. Hence, fault-tolerant scheduling is an imperative step for large-scale computing systems. The existing fault-tolerant scheduling algorithms belong to static scheduling, and they allocate multiple copies of each task to several processors no matter whether processor failures affect the execution of tasks. Such active replication strategies not only waste resource but also sacrifice the makespan. What is more, they cannot guarantee the successful execution of applications. In this paper, we propose a fault-tolerant dynamic rescheduling algorithm named FTDR, which can overcome above drawbacks. FTDR keeps listening to the processor failure, and reschedules the suspended tasks once failures occur. Because FTDR reschedules the tasks that are suspended because of failures, it can tolerate an arbitrary number of failures. Randomly generated DAGs are tested in our experiments. Experimental results show that the proposed algorithm achieves good performance in terms of makespan and resource consumption compared with its direct competitors.

出版日期2015-12
单位湖南大学

全文

访问全文

收藏分享被引(21) 浏览

更新时间：2024-05-18 08:27

Fault-Tolerant Dynamic Rescheduling for Heterogeneous Computing Systems

摘要

全文

产品服务

站内浏览

服务支持

联系方式

科研之友