摘要

The failure of Cloud sites and variability of performance of the virtual machines (VMs) in this environment are two issues that have to be taken into account by software providers. If they want to guarantee the return of the results on time to their customers, their virtual infrastructure must be designed to adapt itself to the new scenario. This is especially critical in compute intensive applications that execute on virtual clusters with a large number of VMs, because they can need hours or days to produce valid results. Changes in the performance could mean longer times to produce results and, probably, higher costs. Site failures usually force to restart from the beginning, losing many computing hours. In this paper we present a fault-tolerant virtual cluster architecture that can tackle with both issues in the context of compute intensive bag-of-tasks applications. It includes an Elasticity Engine that uses the application performance to decide about the enlargement or reduction of the virtual cluster to fulfill the expectations of the final users. The architecture has been tested in three experiments: execution of the application in a multi-site configuration which has shown that it is not suffering from any penalty because of its execution in a distributed environment; an experiment about Specific Deadline Objective where the Elasticity Engine takes decisions about the enlargement of the cluster with new VMs to end the simulation on time; and a fault-tolerance test where one part of a distributed virtual cluster is lost, restoring the application performance on the surviving Cloud site using recovering mechanisms and elasticity rules, without interruption of the service.

  • 出版日期2014-5