Choreographer: A Full-System Framework for Fine-Grained Tasks in Cache Hierarchies

📅 2025-10-30

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

Existing evaluation frameworks fail to accurately model hardware-software co-optimization overheads for latency-sensitive, fine-grained accelerators within cache hierarchies—particularly overlooking interaction delays introduced by the AMBA CHI coherence protocol and the Linux software stack, as well as performance variability induced by cache configuration. This paper presents the first cache-level, fine-grained task evaluation framework supporting full-system simulation. Built upon gem5, it deeply integrates the CHI coherence network and the Linux software stack, offering a C++ API and modular configuration to enable end-to-end modeling across CPU cores, caches, and accelerators. It innovatively enables fine-grained quantification of address translation, memory access, and protocol overheads. Evaluated on graph-analytics prefetching and fast-sorting accelerators, the framework achieves 1.08×–1.88× and over 2× speedup, respectively, demonstrating its effectiveness in guiding low-latency accelerator design.

Technology Category

Application Category

📝 Abstract

In this paper, we introduce Choreographer, a simulation framework that enables a holistic system-level evaluation of fine-grained accelerators designed for latency-sensitive tasks. Unlike existing frameworks, Choreographer captures all hardware and software overheads in core-accelerator and cache-accelerator interactions, integrating a detailed gem5-based hardware stack featuring an AMBA coherent hub interface (CHI) mesh network and a complete Linux-based software stack. To facilitate rapid prototyping, it offers a C++ application programming interface and modular configuration options. Our detailed cache model provides accurate insights into performance variations caused by cache configurations, which are not captured by other frameworks. The framework is demonstrated through two case studies: a data-aware prefetcher for graph analytics workloads, and a quicksort accelerator. Our evaluation shows that the prefetcher achieves speedups between 1.08x and 1.88x by reducing memory access latency, while the quicksort accelerator delivers more than 2x speedup with minimal address translation overhead. These findings underscore the ability of Choreographer to model complex hardware-software interactions and optimize performance in small task offloading scenarios.

Problem

Research questions and friction points this paper is trying to address.

Simulating fine-grained accelerators for latency-sensitive tasks holistically

Capturing hardware-software overheads in core-accelerator and cache interactions

Modeling cache configuration impacts on performance for task offloading

Innovation

Methods, ideas, or system contributions that make the work stand out.

Framework integrates gem5 hardware with Linux software stack

Provides C++ API and modular options for rapid prototyping

Detailed cache model captures performance variations accurately

🔎 Similar Papers

Virtuoso: Enabling Fast and Accurate Virtual Memory Research via an Imitation-based Operating System Simulation Methodology