Developing a scalable hybrid MPI/OpenMP unstructured finite element model

作者:Guo Xiaohu*; Lange Michael; Gorman Gerard; Mitchell Lawrence; Weiland Michele
来源:Computers & Fluids, 2015, 110: 227-234.
DOI:10.1016/j.compfluid.2014.09.007

摘要

The trend of all modern computer architectures, and the path to exascale, is towards increasing numbers of lower power cores, with a decreasing memory to core ratio. This imposes a strong evolutionary pressure on algorithms and software to efficiently utilise all levels of parallelism available on a given platform while minimising data movement. Unstructured finite elements codes have long been effectively parallelised using domain decomposition methods, implemented using libraries such as the Message Passing Interface (MPI). However, there are many optimisation opportunities when threading is used for intra-node parallelisation for the latest multi-core/many-core platforms. The benefits include increased algorithmic freedom, reduced memory requirements, cache sharing, reduced number of partitions, less MPI communication and I/O overhead. In this paper, we report progress in implementing a hybrid OpenMP MPI version of the unstructured finite element code Fluidity. For matrix assembly kernels, the OpenMP parallel algorithm uses graph colouring to identify independent sets of elements that can be assembled concurrently with no race conditions. In this phase there are no MPI overheads as each MPI process only assembles its own local part of the global matrix. We use an OpenMP threaded fork of PETSc to solve the resulting sparse linear systems of equations. We experiment with a range of preconditioners, including HYPRE which provides the algebraic multigrid preconditioner BoomerAMG where the smoother is also threaded. Since unstructured finite element codes are well known to be memory latency bound, particular attention is paid to ccNUMA architectures where data locality is particularly important to achieve good intra-node scaling characteristics. We also demonstrate that utilising non-blocking algorithms and libraries are critical to mixed-mode application so that it can achieve better parallel performance than the pure MPI version.

  • 出版日期2015-3-30