摘要

In grid computing, resource management and fault tolerance services are important issues. Because the numbers of the application tasks and amounts of required resources are enormous and quick responses to the requirements of users are necessary in the real grid environment, real-time resource co-allocation may be large-scale. This paper proposes an Active Grid Information Server (AGIS) that is a resource manager for optimal resource selection and fault tolerant service using a database management system that supports event-condition-action (ECA) rules. Our resource manager automatically selects the set of optimal resources among idle resources that achieves optimal performance while turnaround time is chosen as metric for performance evaluation. Typically, the probability of a failure is higher in grid computing than in traditional parallel computing and the failure of resources affects job execution fatally. Therefore, a fault tolerance service is essential in computational grids. Grid services are often expected to meet some minimum levels of Quality of Service (QoS) for a desirable operation. To address this issue, we also propose a fault tolerance service that satisfies QoS requirements. The fault tolerance requires timely notification of changes, raising the need for mechanisms for monitoring and processing such changes. Event-condition-action (ECA) rules are a natural candidate to fulfill this need. We develop conservative tests for determining the termination and confluence of sets of ECA rules. We argue that the employment of ECA rules, both for resource selection and fault tolerance, leads to efficiency and to additional techniques. Furthermore, the proposed AGIS system architecture offers a number of advantages owing to the performance and scalability that can be achieved using active databases. Our preliminary performance results indicate that the ECA rule-based approach for resource matching is efficient in speed and accuracy and can keep up with high job-arrival rates - an important criterion for online resource matching systems. We describe Grid-JQA, an architecture supporting such rules in grid environments, and our current implementation of this architecture. Three heuristic approaches have been designed and compared via simulations to match tasks which take into account the QoS requested by the tasks, and at the same time, to minimize the tasks makespan as much as possible. Also, an optimum method based on the performance metric has been designed to compare the performance of the heuristics developed. Our proposed solution has at least a 45% improvement over the general method which uses a first come, first served (FCFS) strategy. The implementation and simulation results indicate that our approaches are promising in that the resource manager finds the optimal set of resources to guarantee efficient job execution, the fault manager guarantees that the submitted jobs are completed, and job execution is improved owing to job duplication even if some failures occur.

  • 出版日期2008-4