摘要

Wide Single Instruction Multiple Data (SIMD) architectures are very important in the compute-intensive applications, but less efficient for applications with cross-iteration dependency loops which are difficult to parallelize and vectorize. This paper introduces Decoupled Iteration Mapping (DIM), a technique that dynamically maps a cross-iteration dependency loop onto the improved SIMD architecture which achieved multicore-like thread-parallel performance. The minor modification on the baseline architecture is composed of a Prefetch Unit & Instruction Buffer Array (PU&IBA), a Loop Control Unit & Instruction Dispatch Unit (LCU&IDU), and a Data Buffer Chain (DBC). Experimental results show that, the proposed DIM scheme can achieve average 3.04x performance speedup with a cost of only 6.44% area overhead.

全文