摘要

Accelerator-based chip multiprocessors (ACMPs), which combine application-specific HW accelerators (ACCs) with host processor core(s), are promising architectures for high-performance and power-efficient computing. However, ACMPs with many ACCs have scalability limitations. The ACCs' performance benefits can be overshadowed by bottlenecks on shared resources of processor core(s), communication fabric/DMA, and on-chip memory. Primarily, this is rooted in the ACCs' data access and the orchestration dependency. Due to very loosely defined ACC communication semantics, and relying on general architectures, the resources bottlenecks hamper performance. This paper explores and alleviates the scalability limitations of ACMPs. To this end, this paper first proposes ACMPerf, an analytical model to capture the impact of the resources bottlenecks on the achievable ACCs' benefits. Then, this paper identifies and formalizes ACC communication semantics which paves the path toward a more scalable integration of ACCs. The semantics describe four primary aspects: 1) data access; 2) data granularity; 3) data marshalling; and 4) synchronization. Finally, this paper proposes a novel architecture of transparent self-synchronizing accelerators (TSS). TSS efficiently realizes our identified communication semantics of direct ACC-to-ACC connections often occurring in streaming applications. TSS delivers more of the ACCs' benefits than conventional ACMP architectures. Given the same set of ACCs, TSS has up to 130x higher throughput and 78x lower energy consumption, mainly due to reducing the load on shared architectural resources by 78.3x.