MemExplorer: Navigating the Heterogeneous Memory Design Space for Agentic Inference NPUs

📅 2026-04-17
📈 Citations: 0
Influential: 0
📄 PDF

career value

247K/year
🤖 AI Summary
This work addresses the significant disparity in memory demands between the prefill and decode phases of large language model (LLM) inference on intelligent agents, a challenge inadequately met by existing heterogeneous NPU systems due to the absence of an efficient cooperative memory architecture. To bridge this gap, the authors propose MemExplorer, the first unified abstraction model encompassing diverse on-chip and off-chip memory technologies—including SRAM, HBM, LPDDR, GDDR, and HBM-FB—that jointly optimizes memory configuration and NPU matrix engine dimensions. A multi-objective optimization algorithm is employed to balance throughput and power consumption. Experimental results demonstrate that, under identical power constraints, MemExplorer achieves 2.3× and 3.23× higher energy efficiency than baseline NPU and H100, respectively, in prefill scenarios; in decode scenarios, it attains 1.93× and 2.72× better power efficiency at equivalent performance levels.

Technology Category

Application Category

📝 Abstract
Emerging agentic LLM workloads are driving rapidly growing demand on both memory capacity and bandwidth, with different phases of inference (e.g., prefill and decode) imposing distinct requirements. Industry is responding by composing heterogeneous accelerators into single interconnected systems, as exemplified by NVIDIA's Vera Rubin platform, where each device brings its own memory architecture. This heterogeneity is further compounded by a widening landscape of available memory technologies: high-density on-chip SRAM, HBM, LPDDR, GDDR, and emerging options such as high-bandwidth flash (HBF), each offering different capacity, bandwidth, and power trade-offs. Identifying the right memory architecture for next-generation inference accelerators requires navigating a vast and rapidly evolving design space, in which the interplay between workload characteristics, NPU design dimensions, and memory system design remains largely underexplored. To address this challenge, we present MemExplorer, a new memory system synthesizer for heterogeneous NPU systems. MemExplorer provides a unified abstraction for modeling diverse memory technologies across different hierarchy levels (e.g., on-chip and off-chip) and automatically determines an efficient heterogeneous memory system together with NPU design choices (e.g., matrix engine size) to balance throughput and power between prefilling and decoding devices in a multi-device NPU system. Experimental results show that, under the same power budget for agentic workloads, MemExplorer achieves up to 2.3x higher energy efficiency than the baseline NPU and 3.23x higher than H100 in the prefill-only setting. Under equivalent performance targets in the decode setting, it further delivers up to 1.93x and 2.72x higher power efficiency over the baseline NPU and H100, respectively.
Problem

Research questions and friction points this paper is trying to address.

heterogeneous memory
agentic inference
NPU design
memory architecture
LLM workloads
Innovation

Methods, ideas, or system contributions that make the work stand out.

heterogeneous memory
NPU design space exploration
agentic LLM inference
memory system synthesis
energy efficiency optimization
H
Haoran Wu
University of Cambridge
Zeyu Cao
Zeyu Cao
University of Cambridge
Yao Lai
Yao Lai
HKU | UT Austin
B
Binglei Lou
Imperial College London
J
Jiayi Nie
University of Cambridge
C
Can Xiao
Imperial College London
T
Timi Adeniran
University of Cambridge
P
Przemyslaw Forys
Imperial College London
K
Kauser Johar
Chipletti
C
Catriona Wright
Chipletti
Junyi Liu
Junyi Liu
Microsoft Research
Hardware accelerationDistributed SystemsHigh-level synthesisFPGA
Kai Shi
Kai Shi
Microsoft
Fiber OpticsSemiconductor LasersOptical Communication Systems
N
Nicholas D. Lane
University of Cambridge
R
Rika Antonova
University of Cambridge
Jianyi Cheng
Jianyi Cheng
University of Edinburgh
high-level synthesiscomputer architectureformal methodsmachine learninghardware security
T
Timothy Jones
University of Cambridge
A
Aaron Zhao
Imperial College London
Robert Mullins
Robert Mullins
Department of Computer Science and Technology, University of Cambridge
Computer Science - Computer Architecture - On-Chip Interconnection Networks