摘要

In this work, we are interested in the problem of task scheduling on large-scale data-intensive computing systems. In order to achieve good performance, one must construct not only good task schedules but also good data allocation across nodes on the system, since before a task can be executed, it must have access to data distributed on the system. In this article, we present a general formulation of a static problem that combines both scheduling and replication problems in data-intensive distributed systems. We show that this problem does not admit an approximation algorithm. However, considering a restricted version of the problem that considers some practical constraints, an approximation algorithm can be designed. From a practical perspective, we introduce a novel heuristic for the problem that is based on nodes clustering. We compare the heuristic with two adapted approaches from other works in the literature by computational simulations using an extensive set of instances based on real computer grids. We show that our heuristic often obtains the best solutions and also runs faster than other approaches.

  • 出版日期2018-9