Understand and Accelerate Memory Processing Pipeline for Disaggregated LLM Inference

📅 2026-03-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the severe inefficiency in long-context large language model (LLM) inference, where memory management overhead accounts for 22%–97% of total execution time and exhibits highly heterogeneous computational characteristics. The study presents the first systematic modeling of the LLM memory processing pipeline, unifying optimizations such as sparse attention and retrieval-augmented generation into a unified four-stage pipeline. It further introduces an innovative GPU-FPGA heterogeneous architecture that offloads sparse, irregular, and memory-intensive operations to the FPGA while retaining compute-intensive tasks on the GPU. Evaluated on AMD MI210 and Alveo U55C platforms, the proposed approach achieves 1.04–2.2× end-to-end speedup and reduces energy consumption by 1.11–4.7× compared to GPU-only baselines, thereby overcoming the limitations of conventional homogeneous acceleration.
📝 Abstract
Modern large language models (LLMs) increasingly depends on efficient long-context processing and generation mechanisms, including sparse attention, retrieval-augmented generation (RAG), and compressed contextual memory, to support complex reasoning. We show that these optimizations can be unified into a four-step memory processing pipeline: Prepare Memory, Compute Relevancy, Retrieval, and Apply to Inference. Through systematic profiling, we identify a 22%-97% memory processing overhead in LLM inference and strong heterogeneity in its computational characteristics. Motivated by this insight, we argue that \textbf{heterogeneous systems} are well-suited to accelerate memory processing and thus end-to-end inference. We demonstrate this approach on a GPU-FPGA system by offloading sparse, irregular, and memory-bounded operations to FPGAs while retaining compute-intensive operations on GPUs. Evaluated on an AMD MI210 GPU and an Alveo U55C FPGA, our system is $1.04\sim2.2\times$ faster and requires $1.11\sim4.7\times$ less energy across multiple LLM inference optimizations than the GPU baseline (similar results hold on NVIDIA A100). These results establish heterogeneous systems as a practical direction for efficient LLM memory processing and inform future heterogeneous hardware design.
Problem

Research questions and friction points this paper is trying to address.

memory processing overhead
heterogeneous systems
LLM inference
long-context processing
computational heterogeneity
Innovation

Methods, ideas, or system contributions that make the work stand out.

heterogeneous systems
memory processing pipeline
LLM inference
GPU-FPGA acceleration
retrieval-augmented generation
🔎 Similar Papers
No similar papers found.