A failure index for HPC applications

作者:Paun Andrei*; Chandler Clayton; Leangsuksun Chokchai Box; Paun Mihaela
来源:Journal of Parallel and Distributed Computing, 2016, 93-94: 146-153.
DOI:10.1016/j.jpdc.2016.04.009

摘要

This paper conducts an examination of log files originating from High Performance Computing (HPC) applications with known reliability problems. The results of this study further the maturation and adoption of meaningful metrics representing HPC system and application failure characteristics. Quantifiable metrics representing the reliability of HPC applications are foundational for building an application resilience methodology critical in the realization of exascale supercomputing. In this examination, statistical inequality methods originating from the study of economics are applied to health and status information contained in HPC application log files. The main result is the derivation of a new failure index metric for HPC a normalized representation of parallel application volatility and/or resiliency to complement existing reliability metrics such as mean time between failure (MTBF), which aims for a better presentation of HPC application resilience. This paper provides an introduction to a Failure Index (FI) for HPC reliability and takes the reader through a use-case wherein the H is used to expose various run-time fluctuations in the failure rate of applications running on a collection of HPC platforms.

  • 出版日期2016-7