Analysis and Diagnosis of SLA Violations in a Production SaaS Cloud

作者:Di Martino Catello*; Sarkar Santonu; Ganesan Rajeshwari; Kalbarczyk Zbigniew T; Iyer Ravishankar K
来源:IEEE Transactions on Reliability, 2017, 66(1): 54-75.
DOI:10.1109/TR.2016.2635033

摘要

service as per its stated service-level agreements (SLAs). While SLA violations in a SaaS platform have been reported, not much work has been done to empirically characterize failures of SaaS. In this paper, we study SLA violations of a production SaaS platform, diagnose the causes, unearth several critical failure modes, and then, suggest various solution approaches to increase the availability of the platform as perceived by the end user. Our approach combines field failure data analysis (FFDA) and fault injection. Our study is based on 283 days of operational logs of the platform. During this time, the platform received business workload from 42 customers spread over 22 countries. We have first developed a set of home-grown FFDA tools to analyze the log, and second implemented a fault injector to automatically inject several runtime errors in the application code written in. NET/C#, and then, collate the injection results. We summarize our finding as: first, system failures have caused 93% of all SLA violations; second, our fault injector has been able to recreate a few cases of bursts of SLA violations that could not be diagnosed from the logs; and third, the fault injection mechanism could recreate several error propagation paths leading to data corruptions that the failure data analysis could not reveal. Finally, the paper presents some system-level implication of this study and how the joint use of fault injection and log analysis may help in improving the reliability of the measured platform.

  • 出版日期2017-3