A general and fast distributed system for large-scale dynamic programming applications

作者:Wang, Chen; Yu, Ce*; Tang, Shanjiang; Xiao, Jian; Sun, Jizhou; Meng, Xiangfei
来源:Parallel Computing, 2016, 60: 1-21.
DOI:10.1016/j.parco.2016.09.002

摘要

Dynamic programming is an important technique widely used in many scientific applications. Due to the massive volume of applications' data in practice, parallel and distributed DP is a must. However, writing a parallel and distributed DP program is difficult and error prone because of its intrinsically strong data dependency. In this paper, we present DPX10, a DAG-based distributed X10 framework aiming at simplifying the parallel programming for DP applications. DPX10 enables users to write highly efficient parallel DP programs without much effort. For DPX10 programming, users only need to do two things: 1) Instantiating a DAG pattern by indicating the dependency between vertices of the DAG; 2) Implementing the DP application's logic in the compute method of the vertices. DPX10 provides eight commonly used DAG patterns and a simple API to allow users to customize their own DAG patterns. All the tiresome work of DP parallelization including DAG distribution, tasks scheduling, and tasks communication are hidden from users and covered by DPX10. Moreover, DPX10 is fault-tolerant and has a mechanism to handle the problem of straggler tasks, which run much slower than other tasks due to unexpected resource contention. Finally, we use four DP applications with up to 2 billion vertices running on 240 cores to demonstrate the simplicity, efficiency, and scalability of our proposed framework.