Measuring the Impact of Memory Errors on Application Performance

作者:Gottscho Mark; Shoaib Mohammed; Govindan Sriram; Sharma Bikash; Wang Di; Gupta Puneet
来源:IEEE Computer Architecture Letters, 2017, 16(1): 51-55.
DOI:10.1109/LCA.2016.2599513

摘要

Memory reliability is a key factor in the design of warehouse-scale computers. Prior work has focused on the performance overheads of memory fault-tolerance schemes when errors do not occur at all, and when detected but uncorrectable errors occur, which result in machine downtime and loss of availability. We focus on a common third scenario, namely, situations when hard but correctable faults exist in memory; these may cause an "avalanche" of errors to occur on affected hardware. We expose how the hardware/software mechanisms for managing and reporting memory errors can cause severe performance degradation in systems suffering from hardware faults. We inject faults in DRAM on a real cloud server and quantify the single-machine performance degradation for both batch and interactive workloads. We observe that for SPEC CPU2006 benchmarks, memory errors can slow down average execution time by up to 2.5x. For an interactive web-search workload, average query latency degrades by up to 2.3x for a light traffic load, and up to an extreme 3746x under peak load. Our analyses of the memory error-reporting stack reveals architecture, firmware, and software opportunities to improve performance consistency by mitigating the worst-case behavior on faulty hardware.

  • 出版日期2017-6
  • 单位Microsoft