摘要

The evolution of hardware architectures driven by the increasing requirement for performance and energy efficiency has led to complex HPC systems. In the context of Finite Element Methods, exposing massive parallelism on unstructured mesh computations with efficient load balancing and minimal synchronizations is challenging: Several parallelization strategies have to be combined together to exploit the multiple levels of parallelism. We propose several contributions aimed at addressing irregular codes and data structures in an efficient way. We have developed a hybrid parallelization approach based on the Divide & Conquer (D&C) principle which combines the distributed, shared, and vectorial forms of parallelism in a fine grain task-based parallelism approach applied to irregular structures. We experiment our approach using a matrix assembly step of an industrial application from Dassault Aviation on standard Xeon multicores and Xeon Phi KNC manycores. On 512 Intel Xeon E5-2670 Sandy Bridge cores, we surpass the pure MPI approach by up to 3.47 x and reach 77% of parallel efficiency using only 2000 vertices per core. On 4 Xeon Phi 5110p KNC, D&C has similar performance to 96 Intel Xeon E5-2670 Sandy Bridge cores; it achieves an excellent parallel efficiency of 96%, and up to 6.56 x speedup compared to pure MPI.

  • 出版日期2018-4

全文