A Highly Efficient GPU-CPU Hybrid Parallel Implementation of Sparse LU Factorization

Liu Li; Liu Li; Yang Guangwen

摘要

In this paper, we try to accelerate sparse LU factorization on CPU. We present a tiled storage format and a parallel algorithm to improve the memory access pattern, and a register blocking method to compress the on-chip working set. The OPENMP implementation of our algorithm gives more stable performance over different matrices, and outperforms SuperLU and KLU by 1.88 similar to 6 times on an Intel 8-core CPU (Central processing unit) for matrices from the Florida matrix collection. Based on this algorithm, we further propose a GPU-CPU hybrid pipelined scheme to overlap computations on CPU with computations on GPU. Compared to the better of SuperLU and KLU on an Intel 8-core CPU, our algorithm achieves 1.1 similar to 19.7-fold speedup on GPU for double precision. Compared to the OPENMP implementation of our algorithm on an Intel 8-core CPU, our GPU implementation gets a 2-fold speedup for the best cases.

出版日期2012-1
单位清华大学

全文

下载全文

收藏分享被引(4) 浏览

更新时间：2018-08-02 21:00

A Highly Efficient GPU-CPU Hybrid Parallel Implementation of Sparse LU Factorization

摘要

全文

产品服务

站内浏览

服务支持

联系方式

科研之友