摘要

Stream computing systems are designed for high frequency data. Such systems can deal with billions of transactions per day in real cases. Cloud technology can support distributed stream computing systems by its elastic and fault tolerant capabilities. In a real deployment environment, such as the pre-treatment system in Chinese top banks, the reliability based on user experience is key metrics for performance. Although many significant works have been proposed in the literature, they have some limitations such as less of architectural focus or difficult to implement in complex projects. This paper describes the reliability issue which is caused by the service downgrade in cloud. We use novel reliability analysis techniques, queuing theory, and software rejuvenation management techniques to build a framework for supporting stream data with low latency and fault tolerance. A real streaming system from a top bank is studied to provide the supporting data. Operational parameters such as rejuvenation window and time-out parameter are identified as key parameters for the design of a distributed stream processing system. An algorithm for reliability optimization, monitoring and forecast is also introduced. The paper also compares the improved result with original issues, which saved millions of money and reputations.