GainSight: Application-Guided Profiling for Composing Heterogeneous On-Chip Memories in AI Hardware Accelerators

📅 2025-04-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Conventional SRAM-based on-chip memory in AI accelerators faces a fundamental density–efficiency trade-off, failing to simultaneously achieve high density and low power consumption while lacking architectural adaptability to dynamic application characteristics. Method: This paper proposes a novel “application-behavior-driven heterogeneous memory co-configuration” paradigm, integrating fine-grained dynamic application profiling—enabled by lightweight hardware instrumentation and retargetable simulation—into on-chip memory architecture decisions. It enables cross-architecture (GPU/systolic array) memory access pattern analysis and data lifetime modeling. Contribution/Results: We uncover a dominant short-lived data phenomenon (79% of systolic accesses and 40% of GPU L1 accesses involve short-lived data), establishing theoretical foundations and quantitative guidelines for system-level Si-GCRAM deployment. We further identify near-zero reuse (90%) in GPU cache loads, exposing severe cache pollution. Experiments demonstrate 11–28% energy reduction with Si-GCRAM adaptation.

Technology Category

Application Category

📝 Abstract
As AI workloads drive soaring memory requirements, there is a need for higher-density on-chip memory for domain-specific accelerators that goes beyond what current SRAM technology can provide. We motivate that algorithms and application behavior should guide the composition of heterogeneous on-chip memories. However, there has been little work in factoring dynamic application profiles into such design decisions. We present GainSight, a profiling framework that analyzes fine-grained memory access patterns and computes data lifetimes in domain-specific accelerators. By combining instrumentation and simulation across retargetable hardware backends, GainSight aligns heterogeneous memory designs with workload-specific traffic and lifetime metrics. Case studies on MLPerf Inference and PolyBench workloads using NVIDIA H100 GPUs and systolic arrays reveal key insights: (1) 40% of L1 and 18% of L2 GPU cache accesses, and 79% of systolic array scratchpad accesses across profiled workloads are short-lived and suitable for silicon-based gain cell RAM (Si-GCRAM); (2) Si-GCRAM reduces active energy by 11-28% compared to SRAM; (3) Up to 90% of GPU cache fetches are never reused, highlighting inefficiencies in terms of cache pollution. These insights that GainSight provides can be used to better understand the design spaces of both emerging on-chip memories and software algorithmic optimizations for the next generation of AI accelerators.
Problem

Research questions and friction points this paper is trying to address.

Optimizing on-chip memory composition for AI accelerators using application behavior
Profiling memory access patterns to guide heterogeneous memory design
Reducing energy consumption by replacing SRAM with Si-GCRAM where suitable
Innovation

Methods, ideas, or system contributions that make the work stand out.

Profiles fine-grained memory access patterns
Combines instrumentation and simulation techniques
Aligns memory designs with workload-specific metrics
🔎 Similar Papers
No similar papers found.
P
Peijing Li
Stanford University
M
Matthew Hung
Stanford University
Y
Yiming Tan
Stanford University
Konstantin Hossfeld
Konstantin Hossfeld
PhD Student at Stanford University
Computer ArchitectureCompilers
J
Jake Jiajun Cheng
Stanford University
S
Shuhan Liu
Stanford University
L
Lixian Yan
Stanford University
X
Xinxin Wang
Stanford University
H.-S. Philip Wong
H.-S. Philip Wong
Professor of Electrical Engineering, Stanford University
electron devicesVLSIsolid-statenanotechnologynano
Thierry Tambe
Thierry Tambe
Assistant Professor of Electrical Engineering, Stanford University
Computer ArchitectureVLSI