DCatch: Automatically Detecting Distributed Concurrency Bugs in Cloud Systems

作者:Liu Haopeng*; Li Guangpu; Lukman Jeffrey F; Li Jiaxin; Lu Shan; Gunawi Haryadi S; Tian Chen
来源:ACM Sigplan Notices, 2017, 52(4): 677-691.
DOI:10.1145/3037697.3037735

摘要

In big data and cloud computing era, reliability of distributed systems is extremely important. Unfortunately, distributed concurrency bugs, referred to as DCbugs, widely exist. They hide in the large state space of distributed cloud systems and manifest non-deterministically depending on the timing of distributed computation and communication. Effective techniques to detect DCbugs are desired. This paper presents a pilot solution, DCatch, in the world of DCbug detection. DCatch predicts DCbugs by analyzing correct execution of distributed systems. To build DCatch, we design a set of happens-before rules that model a wide variety of communication and concurrency mechanisms in real-world distributed cloud systems. We then build run-time tracing and trace analysis tools to effectively identify concurrent conflicting memory accesses in these systems. Finally, we design tools to help prune false positives and trigger DCbugs. We have evaluated DCatch on four representative open-source distributed cloud systems, Cassandra, Hadoop MapReduce, HBase, and ZooKeeper. By monitoring correct execution of seven workloads on these systems, DCatch reports 32 DCbugs, with 20 of them being truly harmful.

  • 出版日期2017-4