摘要

The traditional multiple CPUs mounted on one node in a high performance cluster is based on Symmetric Multi-Processing (SMP) architecture. The memory bandwidth is a major bottleneck in the high performance computing. Recently, Intel and AMD companies developed the (Non-uniform Memory Access (NUMA) architecture for the multi-CPU server that is an important extension of the SMP computer. In the NUMA architecture server, each CPU has its own memory and can also be access to the memory located the nearby of other CPUs through the onboard network. For a parallel code, we can allocate the data for each CPU inside its local memory to accelerate the memory access. In this paper, we investigate a way how to achieve the high performance of parallel FDTD code on a computer cluster that includes 21 nodes with 42 CPU and 168 cores. Numerical experiments have demonstrated that different job binding schemes can significantly affect the performance of parallel FDTD code.