A simplified reliability analysis method for cloud computing systems considering common-cause failures

作者:Li, Ruiying*; Li, Qiong; Huang, Ning; Kang, Rui
来源:Proceedings of the Institution of Mechanical Engineers - Part O: Journal of Risk and Reliability , 2017, 231(3): 324-333.
DOI:10.1177/1748006X17703863

摘要

Virtualization is one of the main features of cloud computing systems, which enables building multiple virtual machines on a single server. However, this feature brings new challenge in reliability modeling, as the failure of the server will make all its co-located virtual machines inoperable, which is a typical common-cause failure. To satisfy the demand of the cloud computing system, the reliability of the system is defined as the probability that at least a given number of virtual machines are operable. State-space enumeration is one method to calculate such reliability; however, due to the large number of combinations, it is time-consuming and impractical. To solve this problem, we propose a simplified reliability analysis method based on fault tree and state-space models. Two illustrative examples are studied to show the process and the effectiveness of our method. State enumeration and Monte Carlo simulation are also used to prove the correctness of our method as back-to-back verifications. Compared to the reliability analysis without considering common-cause failures, our results are quite different, which illustrates the necessity of considering common-cause failures in the reliability of cloud computing systems.