An investigation of the effects of hard and soft errors on graphics processing unit-accelerated molecular dynamics simulations

作者:Betz Robin M; DeBardeleben Nathan A; Walker Ross C*
来源:Concurrency and Computation: Practice and Experience (CCPE) , 2014, 26(13): 2134-2140.
DOI:10.1002/cpe.3232

摘要

Molecular dynamics (MD) simulations rely on the accurate evaluation and integration of Newton's equations of motion to propagate the positions of atoms in proteins during a simulation. As such, one can expect them to be sensitive to any form of numerical error that may occur during a simulation. Increasingly graphics processing units (GPUs) are being used to accelerate MD simulations. Current GPU architectures designed for high performance computing applications support error-correcting codes (ECC) that detect and correct single bit-flip soft error events in GPU memory; however, this error checking carries a penalty in terms of simulation speed. ECC is also a major distinguishing feature between high performance computing NVIDIA Tesla cards and the considerably more cost-effective NVIDIA GeForce gaming cards. An argument often put forward for not using GeForce cards is that the results are unreliable because of the lack of ECC. In an initial attempt to quantify these concerns, an investigation of the reproducibility of GPU-accelerated MD simulations using the AMBER software was conducted on the XSEDE supercomputer Keeneland, a cluster at Los Alamos National Laboratory, and a cluster at the San Diego Supercomputer Center. While the data collected are insufficient to make solid conclusions and more extensive testing is needed to provide quantitative statistics, the absence of ECC events and lack of any silent errors in all the simulations conducted to date suggest that these errors are exceedingly rare and as such the time and memory penalty of ECC may outweigh the utility of error checking functionality. However, a considerable amount of error originating from defective hardware was observed, which suggests that rigorous acceptance testing should be performed on new GPU-based systems by repeatedly running reproducible yet realistic calculations.

  • 出版日期2014-9-10
  • 单位Los Alamos