🤖 AI Summary
This work addresses the significant disparity in memory demands between the prefill and decode phases of large language model (LLM) inference on intelligent agents, a challenge inadequately met by existing heterogeneous NPU systems due to the absence of an efficient cooperative memory architecture. To bridge this gap, the authors propose MemExplorer, the first unified abstraction model encompassing diverse on-chip and off-chip memory technologies—including SRAM, HBM, LPDDR, GDDR, and HBM-FB—that jointly optimizes memory configuration and NPU matrix engine dimensions. A multi-objective optimization algorithm is employed to balance throughput and power consumption. Experimental results demonstrate that, under identical power constraints, MemExplorer achieves 2.3× and 3.23× higher energy efficiency than baseline NPU and H100, respectively, in prefill scenarios; in decode scenarios, it attains 1.93× and 2.72× better power efficiency at equivalent performance levels.
📝 Abstract
Emerging agentic LLM workloads are driving rapidly growing demand on both memory capacity and bandwidth, with different phases of inference (e.g., prefill and decode) imposing distinct requirements. Industry is responding by composing heterogeneous accelerators into single interconnected systems, as exemplified by NVIDIA's Vera Rubin platform, where each device brings its own memory architecture.
This heterogeneity is further compounded by a widening landscape of available memory technologies: high-density on-chip SRAM, HBM, LPDDR, GDDR, and emerging options such as high-bandwidth flash (HBF), each offering different capacity, bandwidth, and power trade-offs.
Identifying the right memory architecture for next-generation inference accelerators requires navigating a vast and rapidly evolving design space, in which the interplay between workload characteristics, NPU design dimensions, and memory system design remains largely underexplored.
To address this challenge, we present MemExplorer, a new memory system synthesizer for heterogeneous NPU systems. MemExplorer provides a unified abstraction for modeling diverse memory technologies across different hierarchy levels (e.g., on-chip and off-chip) and automatically determines an efficient heterogeneous memory system together with NPU design choices (e.g., matrix engine size) to balance throughput and power between prefilling and decoding devices in a multi-device NPU system.
Experimental results show that, under the same power budget for agentic workloads, MemExplorer achieves up to 2.3x higher energy efficiency than the baseline NPU and 3.23x higher than H100 in the prefill-only setting. Under equivalent performance targets in the decode setting, it further delivers up to 1.93x and 2.72x higher power efficiency over the baseline NPU and H100, respectively.