Combating the Memory Walls: Optimization Pathways for Long-Context Agentic LLM Inference

📅 2025-09-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the severe off-chip memory access overhead and “memory wall” bottleneck—arising from ultra-large contexts in long-context AI agent inference—that critically limit compute unit utilization due to bandwidth and capacity constraints, this paper proposes PLENA, a software-hardware co-designed inference system. PLENA features a custom hardware architecture supporting asymmetric quantization, a flattened systolic array, and native FlashAttention acceleration units. It further delivers a full-stack solution encompassing a custom ISA, optimizing compiler, cycle-accurate simulator, and automated design-space exploration framework. Experimental evaluation demonstrates that, under identical compute resources, PLENA achieves an 8.5× improvement in compute unit utilization and delivers 2.24× and 3.85× higher throughput than NVIDIA A100 GPUs and Google TPU v6e, respectively. The entire system—including hardware RTL, compiler, and toolchain—will be fully open-sourced.

Technology Category

Application Category

📝 Abstract
LLMs now form the backbone of AI agents for a diverse array of applications, including tool use, command-line agents, and web or computer use agents. These agentic LLM inference tasks are fundamentally different from chatbot-focused inference -- they often have much larger context lengths to capture complex, prolonged inputs, such as entire webpage DOMs or complicated tool call trajectories. This, in turn, generates significant off-chip memory traffic for the underlying hardware at the inference stage and causes the workload to be constrained by two memory walls, namely the bandwidth and capacity memory walls, preventing the on-chip compute units from achieving high utilization. In this paper, we introduce PLENA, a hardware-software co-designed system that applies three core optimization pathways to tackle these challenges. PLENA includes an efficient hardware implementation of compute and memory units supporting an asymmetric quantization scheme. PLENA also features a novel flattened systolic array architecture that has native support for FlashAttention to tackle these memory walls in the scenario of inference serving for long-context LLMs. Additionally, PLENA is developed with a complete stack, including a custom ISA, a compiler, a cycle-emulated simulator, and an automated design space exploration flow. The simulated results show that PLENA achieves up to 8.5x higher utilization than existing accelerators, and delivers 2.24x higher throughput than the A100 GPU and 3.85x higher throughput than the TPU v6e, under the same multiplier count and memory settings. The full PLENA system will also be open-sourced.
Problem

Research questions and friction points this paper is trying to address.

Addressing memory bandwidth and capacity walls in long-context LLM inference
Optimizing hardware for large context inputs like webpage DOMs
Improving computational utilization in agentic LLM inference systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hardware-software co-designed system
Asymmetric quantization scheme implementation
Flattened systolic array architecture
🔎 Similar Papers
No similar papers found.
H
Haoran Wu
University of Cambridge, Cambridge, UK
C
Can Xiao
Imperial College London, London, UK
J
Jiayi Nie
University of Cambridge, Cambridge, UK
X
Xuan Guo
Imperial College London, London, UK
B
Binglei Lou
Imperial College London, London, UK
Jeffrey T. H. Wong
Jeffrey T. H. Wong
Imperial College London
Efficient Machine LearningDeep Learning
Zhiwen Mo
Zhiwen Mo
Imperial College London
GPU ArchitecturePerformance ModelingDataflow Schedule
C
Cheng Zhang
Imperial College London, London, UK
P
Przemyslaw Forys
Imperial College London, London, UK
Wayne Luk
Wayne Luk
Professor of Computer Engineering, Imperial College London
Hardware and ArchitectutreReconfigurable ComputingDesign Automation
H
Hongxiang Fan
Imperial College London, London, UK
Jianyi Cheng
Jianyi Cheng
University of Edinburgh
high-level synthesiscomputer architectureformal methodsmachine learninghardware security
Timothy M. Jones
Timothy M. Jones
University of Cambridge
CompilersMicroarchitectureParallelismReliability
R
Rika Antonova
University of Cambridge, Cambridge, UK
Robert Mullins
Robert Mullins
Department of Computer Science and Technology, University of Cambridge
Computer Science - Computer Architecture - On-Chip Interconnection Networks
A
Aaron Zhao
Imperial College London, London, UK