Auto-Tuning of Thread Assignment for Matrix-Vector Multiplication on GPUs

Wang Jinwei<sup>*</sup>; Ma Xirong; Zhu Yuanping; Sun Jizhou

doi:10.1587/transinf.E96.D.2319

摘要

Modern GPUs have evolved to become a more general processor capable of executing scientific and engineering computations. It provides a highly parallel computing environment due to its large number of computing cores, which are suitable for numerous data parallel arithmetic computations, particularly linear algebra operations. The matrix-vector multiplication is one of the most important dense linear algebraic operations. It is applied to a diverse set of applications in many fields and must therefore be fully optimized to achieve a high-performance. In this paper, we proposed a novel auto-tuning method for matrix-vector multiplication on GPUs, where the number of assigned threads that are used to compute one element of the result vector can be auto-tuned according to the size of matrix. On the Nvidia's GPU GTX 650 with the most recent Kepler architecture, we developed an auto-tuner that can automatically select the optimal number of assigned threads for calculation. Based on the auto-tuner's result, we developed a versatile generic matrix-vector multiplication kernel with the CUDA programming model. A series of experiments on different shapes and sizes of matrices were conducted for comparing the performance of our kernel with that of the kernels from CUBLAS 5.0, MAGMA 1.3 and a warp method. The experiments results show that the performance of our matrix-vector multiplication kernel is close to the optimal behavior with increasing of the size of the matrix and has very little dependency on the shape of the matrix, which is a significant improvement compared to the other three kernels that exhibit unstable performance behavior for different shapes of matrices.

出版日期2013-11
单位天津大学; 天津师范大学

全文

访问全文

收藏分享被引浏览

更新时间：2019-03-28 14:23

Auto-Tuning of Thread Assignment for Matrix-Vector Multiplication on GPUs

摘要

全文

产品服务

站内浏览

服务支持

联系方式

科研之友