摘要

A MapReduce framework abstracts distributed system issues, integrating a distributed file system with an application's needs. However, the lack of determinism in distributed system components and reliability in the network may cause applications errors that are difficult to identify, find, and correct. This paper presents a method to create a set of fault cases, derived from a Petri net (PN), and a framework to automate the execution of these fault cases in a distributed system. The framework controls each MapReduce component and injects faults according to the component's state. Experimental results showed the fault cases are representative for testing Hadoop, a MapReduce implementation. We tested three versions of Hadoop and identified bugs and elementary behavioral differences between the versions. The method provides network reliability enhancements as a byproduct because it identifies errors caused by a service or system bug instead of simply assigning them to the network.

  • 出版日期2015-7-5