SkipOPU: An FPGA-based Overlay Processor for Large Language Models with Dynamically Allocated Computation

📅 2026-03-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the computational inefficiency of large language model (LLM) inference caused by rapidly growing parameter counts and the inability of existing hardware to support dynamic computation allocation. The authors propose SkipOPU, an FPGA-based overlay processor that enables dynamic scheduling of computations across both tokens and layers through a lightweight routing mechanism, while fusing nonlinear and linear operations to hide latency. Key innovations include incremental reduction, operator fusion, a mixed-precision floating-point/fixed-point processing element array, DSP hyper-packing, and proactive on-chip key-value (KV) cache reuse. Implemented on an AMD Alveo U280 FPGA, SkipOPU achieves 1.23×–3.83× higher bandwidth efficiency compared to GPU and other FPGA accelerators and reduces KV memory overhead by up to 25.4%.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have achieved remarkable performance across a wide range of tasks, but their inference efficiency remains a critical bottleneck due to rapidly growing parameters. Recent advances in dynamic computation allocation address this challenge by exploiting the highly uneven contributions of different tokens and layers, enabling selective execution that significantly reduces redundant computation while preserving model accuracy. However, existing hardware platforms and accelerators are primarily optimized for uniform, static execution, limiting their ability to efficiently support such dynamic inference patterns. In this work, we propose SkipOPU, an FPGA-based overlay processor that dynamically allocates computation across tokens and layers with high flexibility through a lightweight routing mechanism. First, we decouple reduction operations from element-wise computation in nonlinear modules and perform reductions incrementally, which enables both stages to be fused with adjacent linear operations (router or matrix multiplication) for effective latency hiding. Second, motivated by asymmetric sensitivity to numerical precision between activation and weight, we design a PE array that efficiently supports float-fixed hybrid execution. A novel DSP overpacking technique is introduced to maximize hardware utilization while minimizing resource overhead. Finally, we develop a proactive on-chip KV history buffer that exploits cross-layer KV invariance of pruned tokens, eliminating irregular HBM accesses during decoding and supplementing off-chip bandwidth through high-locality on-chip reuse. Experimental results demonstrate that SkipOPU on an AMD U280 FPGA outperforms GPU and other FPGA-based accelerators by 1.23x-3.83x in bandwidth efficiency for LLMs inference with dynamic computation allocation and can reduce up to 25.4% KV storage overhead across varying sequence lengths.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Dynamic Computation Allocation
Hardware Acceleration
FPGA
Inference Efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

dynamic computation allocation
FPGA overlay processor
hybrid precision execution
incremental reduction
on-chip KV caching
Z
Zicheng He
University of California, Los Angeles, USA
A
Anhao Zhao
Institute of Digital Twin, Eastern Institute of Technology, China
Xiaoyu Shen
Xiaoyu Shen
Eastern Institute of Technology, Ningbo
language modelmulti-modal learningreasoning
Chen Wu
Chen Wu
Institute of Computing Technology, Chinese Academy of Sciences
Information RetrievalNatural Language ProcessingAdversarial Attack
H
He Lei
Eastern Institute of Technology, China