🤖 AI Summary
This work addresses the severe inefficiency in long-context large language model (LLM) inference, where memory management overhead accounts for 22%–97% of total execution time and exhibits highly heterogeneous computational characteristics. The study presents the first systematic modeling of the LLM memory processing pipeline, unifying optimizations such as sparse attention and retrieval-augmented generation into a unified four-stage pipeline. It further introduces an innovative GPU-FPGA heterogeneous architecture that offloads sparse, irregular, and memory-intensive operations to the FPGA while retaining compute-intensive tasks on the GPU. Evaluated on AMD MI210 and Alveo U55C platforms, the proposed approach achieves 1.04–2.2× end-to-end speedup and reduces energy consumption by 1.11–4.7× compared to GPU-only baselines, thereby overcoming the limitations of conventional homogeneous acceleration.
📝 Abstract
Modern large language models (LLMs) increasingly depends on efficient long-context processing and generation mechanisms, including sparse attention, retrieval-augmented generation (RAG), and compressed contextual memory, to support complex reasoning. We show that these optimizations can be unified into a four-step memory processing pipeline: Prepare Memory, Compute Relevancy, Retrieval, and Apply to Inference. Through systematic profiling, we identify a 22%-97% memory processing overhead in LLM inference and strong heterogeneity in its computational characteristics. Motivated by this insight, we argue that \textbf{heterogeneous systems} are well-suited to accelerate memory processing and thus end-to-end inference. We demonstrate this approach on a GPU-FPGA system by offloading sparse, irregular, and memory-bounded operations to FPGAs while retaining compute-intensive operations on GPUs. Evaluated on an AMD MI210 GPU and an Alveo U55C FPGA, our system is $1.04\sim2.2\times$ faster and requires $1.11\sim4.7\times$ less energy across multiple LLM inference optimizations than the GPU baseline (similar results hold on NVIDIA A100). These results establish heterogeneous systems as a practical direction for efficient LLM memory processing and inform future heterogeneous hardware design.