🤖 AI Summary
Existing evaluation frameworks fail to accurately model hardware-software co-optimization overheads for latency-sensitive, fine-grained accelerators within cache hierarchies—particularly overlooking interaction delays introduced by the AMBA CHI coherence protocol and the Linux software stack, as well as performance variability induced by cache configuration. This paper presents the first cache-level, fine-grained task evaluation framework supporting full-system simulation. Built upon gem5, it deeply integrates the CHI coherence network and the Linux software stack, offering a C++ API and modular configuration to enable end-to-end modeling across CPU cores, caches, and accelerators. It innovatively enables fine-grained quantification of address translation, memory access, and protocol overheads. Evaluated on graph-analytics prefetching and fast-sorting accelerators, the framework achieves 1.08×–1.88× and over 2× speedup, respectively, demonstrating its effectiveness in guiding low-latency accelerator design.
📝 Abstract
In this paper, we introduce Choreographer, a simulation framework that enables a holistic system-level evaluation of fine-grained accelerators designed for latency-sensitive tasks. Unlike existing frameworks, Choreographer captures all hardware and software overheads in core-accelerator and cache-accelerator interactions, integrating a detailed gem5-based hardware stack featuring an AMBA coherent hub interface (CHI) mesh network and a complete Linux-based software stack. To facilitate rapid prototyping, it offers a C++ application programming interface and modular configuration options. Our detailed cache model provides accurate insights into performance variations caused by cache configurations, which are not captured by other frameworks. The framework is demonstrated through two case studies: a data-aware prefetcher for graph analytics workloads, and a quicksort accelerator. Our evaluation shows that the prefetcher achieves speedups between 1.08x and 1.88x by reducing memory access latency, while the quicksort accelerator delivers more than 2x speedup with minimal address translation overhead. These findings underscore the ability of Choreographer to model complex hardware-software interactions and optimize performance in small task offloading scenarios.