A Persistent-State Dataflow Accelerator for Memory-Bound Linear Attention Decode on FPGA

📅 2026-03-06

📈 Citations: 0

✨ Influential: 0

career value

240K/year

🤖 AI Summary

This work addresses the memory bandwidth bottleneck of linear attention mechanisms—such as Gated DeltaNet—during batch-1 decoding on GPUs, where frequent access to recurrent states in high-bandwidth memory (HBM) limits performance. We present the first FPGA-based decoding architecture that fully caches the 2 MB recurrent state on-chip in block RAM (BRAM), enabling persistent residency throughout inference. By fusing the Gated DeltaNet recurrence into a five-stage pipeline, our design reads and writes each state matrix only once per token. Furthermore, we exploit grouped-value attention and inter-head parallelism (2–16 heads) to overlap data movement and computation. Implemented on an AMD Alveo U55C platform using Vitis HLS, the accelerator shifts the workload from memory-bound to compute-bound, achieving a decoding latency of 63 µs/token—4.5× faster than an NVIDIA H100 PCIe GPU—and up to 60× higher energy efficiency at an on-chip power consumption of 9.96 W.

Technology Category

Application Category

📝 Abstract

Gated DeltaNet (GDN) is a linear attention mechanism that replaces the growing KV cache with a fixed-size recurrent state. Hybrid LLMs like Qwen3-Next use 75% GDN layers and achieve competitive accuracy to attention-only models. However, at batch-1, GDN decode is memory-bound on GPUs since the full recurrent state must be round-tripped through HBM every token. We show that this bottleneck is architectural, not algorithmic, as all subquadratic sequence models exhibit arithmetic intensities below 1 FLOP/B at decode time, making them more memory-bound than standard Transformers. We present an FPGA accelerator that eliminates this bottleneck by holding the full 2 MB recurrent state persistently in on-chip BRAM, converting the workload from memory-bound to compute-bound. Our design fuses the GDN recurrence into a five-phase pipelined datapath that performs only one read and one write pass over each state matrix per token, exploits Grouped Value Attention for paired-head parallelism, and overlaps preparation, computation, and output storage via dataflow pipelining. We explore four design points on an AMD Alveo U55C using Vitis HLS, varying head-level parallelism from 2 to 16 value-heads per iteration. Our fastest configuration achieves 63 $\mu$s per token, 4.5$\times$ faster than the GPU reference on NVIDIA H100 PCIe. Post-implementation power analysis reports 9.96 W on-chip, yielding up to 60$\times$ greater energy efficiency per token decoded.

Problem

Research questions and friction points this paper is trying to address.

memory-bound

linear attention

recurrent state

decode

FPGA

Innovation

Methods, ideas, or system contributions that make the work stand out.

Persistent-State Accelerator

Linear Attention

Memory-Bound Optimization