Axo: Detection and Recovery for Delay and Crash Faults in Real-Time Control Systems

作者:Mohiuddin Maaz; Saab Wajeb*; Bliudze Simon; Le Boudec Jean Yves
来源:IEEE Transactions on Industrial Informatics, 2018, 14(7): 3065-3075.
DOI:10.1109/TII.2017.2772219

摘要

Real-time control systems use controllers that compute and issue setpoints within stringent delay constraints. Failure to do so, due to a crash or delay as a result of software and/or hardware faults, can cause failure of the controlled resources. Recently, Axo, a protocol for masking crash and delay faults by replicating the controller, was proposed. Axo provides safety by discarding delayed setpoints, and it relies on the presence of valid setpoints for providing availability. To ensure that enough valid setpoints are issued, faulty controller replicas need to be detected and recovered. We present a mechanism for detection and recovery of delay-and crash-faulty replicas under the Axo framework. These mechanisms were designed to be soft state (i.e., their state can be reconstructed from received messages) to enable seamless additions of new replicas. Besides presenting the design, we analytically characterize the time to detect and recover a faulty replica, and we validate them experimentally. We demonstrate the performance of Axo by using two case studies: the first provides a stability analysis of an inverted pendulum system with Axo, and the second shows the fault-tolerance performance of Axo through a deployment on a real-time control system that controls a CIGRE low-voltage benchmark microgrid.

  • 出版日期2018-7