An NoC Traffic Compiler for Efficient FPGA Implementation of Sparse Graph-Oriented Workloads

作者:Nachiket Kapre; Andre Dehon
来源:International Journal of Reconfigurable Computing, 2011.
DOI:10.1155/2011/745147

摘要

Parallel graph-oriented applications expressed in the Bulk-Synchronous Parallel (BSP) and Token Dataflow compute models generate highly-structured communication workloads from messages propagating along graph edges. We can statially expose this structure to traffic compilers and optimization tools to reshape and reduce traffic for higher performance (or lower area, lower energy, lower cost). Such offline traffic optimization eliminates the need for complex, runtime NoC hardware and enables lightweight, scalable NoCs. We perform load balancing, placement, fanout routing, and fine-grained synchronization to optimize our workloads for large networks up to 2025 parallel elements for BSP model and 25 parallel elements for Token Dataflow. This allows us to demonstrate speedups between 1.2℅ and 22℅ (3.5℅ mean), area reductions (number of Processing Elements) between 3℅ and 15℅ (9℅ mean) and dynamic energy savings between 2℅ and 3.5℅ (2.7℅ mean) over a range of real-world graph applications in the BSP compute model. We deliver speedups of 0.5每13℅ (geomean 3.6℅) for Sparse Direct Matrix Solve (Token Dataflow compute model) applied to a range of sparse matrices when using a high-quality placement algorithm. We expect such traffic optimization tools and techniques to become an essential part of the NoC application-mapping flow. 1. Introduction Real-world communication workloads exhibit structure in the form of locality, sparsity, fanout distribution, and other properties. If this structure can be exposed to automation tools, we can reshape and optimize the workload to improve performance, lower area, and reduce energy. In this paper, we develop a traffic compiler that exploits structural properties of Bulk-Synchronous Parallel communication workloads. This compiler provides insight into performance tuning of communication-intensive parallel applications. The performance and energy improvements made possible by the compiler allow us to build the NoC from simple hardware elements that consume less area and eliminate the need for using complex, area-hungry, adaptive hardware. We now introduce key structural properties exploited by our traffic compiler. (i)When the natural communicating components of the traffic do not match the granularity of the NoC architecture, applications may end up being poorly load balanced. We discuss Decomposition and Clustering as techniques to improve load balance. (ii) Most applications exhibit sparsity and locality; an object often interacts regularly with only a few other objects in its neighborhood. We exploit these properties by

  • 出版日期2011

全文