A High Performance and Memory Efficient LU Decomposer on FPGAs

作者:Wu, Guiming*; Dou, Yong; Sun, Junqing; Peterson, Gregory D
来源:IEEE Transactions on Computers, 2012, 61(3): 366-378.
DOI:10.1109/TC.2010.278

摘要

LU decomposition for dense matrices is an important linear algebra kernel that is widely used in both scientific and engineering applications. To efficiently perform large matrix LU decomposition on FPGAs with limited local memory, a block LU decomposition algorithm on FPGAs applicable to arbitrary matrix size is proposed. Our algorithm applies a series of transformations, including loop blocking and space-time mapping, onto sequential nonblocking LU decomposition. We also introduce a high performance and memory efficient hardware architecture, which mainly consists of a linear array of processing elements (PEs), to implement our block LU decomposition algorithm. Our design can achieve optimum performance under various hardware resource constraints. Furthermore, our algorithm and design can be easily extended to the multi-FPGA platform by using a block-cyclic data distribution and inter-FPGA communication scheme. A total of 36 PEs can be integrated into a Xilinx Virtex-5 XC5VLX330 FPGA on our self-designed PCI-Express card, reaching a sustained performance of 8.50 GFLOPS at 133 MHz for a matrix size of 16,384, which outperforms several general-purpose processors. For a Xilinx Virtex-6 XC6VLX760, a newer FPGA, we predict that a total of 180 PEs can be integrated, reaching 70.66 GFLOPS at 200 MHz. Compared to the previous work, our design can integrate twice the number of PEs into the same FPGA and has significantly higher performance.