DX100: A Programmable Data Access Accelerator for Indirection

📅 2025-05-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
DRAM bandwidth bottlenecks caused by indirect memory accesses—exacerbated by limited DRAM controller visibility, small request buffers, and insufficient memory-level parallelism—hinder modern multi-core systems. To address this, we propose DX100, a shared, programmable data-access accelerator. DX100 introduces the first general-purpose ISA-compatible architecture for hardware-accelerated indirect memory access, offloading batched indirect address computation and memory requests from multiple cores. It enables cross-core sharing, dynamic request reordering, interleaving, and merging—significantly improving row-buffer hit rates and DRAM bandwidth utilization. A fully automated MLIR-based compilation flow enables zero-modification porting of existing applications. Evaluated on 12 benchmarks spanning scientific computing, databases, and graph analytics, DX100 achieves a 2.6× speedup over a multi-core baseline and outperforms the state-of-the-art indirect prefetcher by 2.0×.

Technology Category

Application Category

📝 Abstract
Indirect memory accesses frequently appear in applications where memory bandwidth is a critical bottleneck. Prior indirect memory access proposals, such as indirect prefetchers, runahead execution, fetchers, and decoupled access/execute architectures, primarily focus on improving memory access latency by loading data ahead of computation but still rely on the DRAM controllers to reorder memory requests and enhance memory bandwidth utilization. DRAM controllers have limited visibility to future memory accesses due to the small capacity of request buffers and the restricted memory-level parallelism of conventional core and memory systems. We introduce DX100, a programmable data access accelerator for indirect memory accesses. DX100 is shared across cores to offload bulk indirect memory accesses and associated address calculation operations. DX100 reorders, interleaves, and coalesces memory requests to improve DRAM row-buffer hit rate and memory bandwidth utilization. DX100 provides a general-purpose ISA to support diverse access types, loop patterns, conditional accesses, and address calculations. To support this accelerator without significant programming efforts, we discuss a set of MLIR compiler passes that automatically transform legacy code to utilize DX100. Experimental evaluations on 12 benchmarks spanning scientific computing, database, and graph applications show that DX100 achieves performance improvements of 2.6x over a multicore baseline and 2.0x over the state-of-the-art indirect prefetcher.
Problem

Research questions and friction points this paper is trying to address.

Accelerates indirect memory accesses to reduce bandwidth bottlenecks
Improves DRAM row-buffer hit rate via request reordering
Provides programmable ISA for diverse access patterns
Innovation

Methods, ideas, or system contributions that make the work stand out.

Programmable accelerator for indirect memory accesses
Reorders and coalesces memory requests efficiently
MLIR compiler transforms legacy code automatically
🔎 Similar Papers
No similar papers found.
A
Alireza Khadem
University of Michigan
K
Kamalavasan Kamalakkannan
Los Alamos National Laboratory
Z
Zhenyan Zhu
University of Michigan
A
Akash Poptani
University of Michigan
Y
Yufeng Gu
University of Michigan
J
Jered Dominguez-Trujillo
Los Alamos National Laboratory
Nishil Talati
Nishil Talati
Assistant Research Scientist, University of Michigan
Computer ArchitectureSystemsGenerative AIData Analytics
Daichi Fujiki
Daichi Fujiki
Institute of Science Tokyo
Computer architecture
S
Scott A. Mahlke
University of Michigan
Galen Shipman
Galen Shipman
Los Alamos National Laboratory
Computer ScienceHigh Performance ComputingFile SystemsProgramming Models
Reetuparna Das
Reetuparna Das
University of Michigan
Computer Architecture