Asteroid: Scalable Online Memory Diagnostics for Multi-core, Multi-socket Servers

作者:Rahman Musfiq; Childers Bruce R*
来源:International Journal of Parallel Programming, 2016, 44(5): 949-974.
DOI:10.1007/s10766-016-0400-2

摘要

Memory diagnostics are important to improving the resilience of DRAM main memory. As bit cell size reaches physical limits, DRAM memory will be more likely to suffer both transient and permanent errors. Memory diagnostics that operate online can be a component of a comprehensive strategy to allay errors. This paper presents a novel approach, Asteroid, to integrate online memory diagnostics during workload execution. The approach supports diagnostics that adapt at runtime to workload behavior and resource availability to maximize test quality while reducing performance overhead. We describe Asteroid's design and how it can be efficiently integrated with a hierarchical memory allocator in modern operating systems. We also present how the framework enables control policies to dynamically configure a diagnostic. Using an adaptive policy, in a 16-core server, Asteroid has modest overhead of 1-4 % for workloads with low to high memory demand. For these workloads, Asteroid's adaptive policy has good error coverage and can thoroughly test memory.

全文