Online Vector Quantized Attention

📅 2026-02-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the longstanding trade-off between computational efficiency and long-context modeling capability in conventional sequence mixture layers. The authors propose OVQ-Attention, a sparse attention mechanism based on online vector quantization and Gaussian mixture regression, which significantly enhances model memory capacity and long-context performance while maintaining linear computational complexity and constant memory footprint. By circumventing the performance limitations of existing linear attention and state space models, OVQ-Attention achieves accuracy comparable to standard self-attention on both synthetic tasks and 64k-length language modeling benchmarks, yet with substantially reduced memory overhead.

Technology Category

Application Category

📝 Abstract
Standard sequence mixing layers used in language models struggle to balance efficiency and performance. Self-attention performs well on long context tasks but has expensive quadratic compute and linear memory costs, while linear attention and SSMs use only linear compute and constant memory but struggle with long context processing. In this paper, we develop a sequence mixing layer that aims to find a better compromise between memory-compute costs and long-context processing, which we call online vector-quantized (OVQ) attention. OVQ-attention requires linear compute costs and constant memory, but, unlike linear attention and SSMs, it uses a sparse memory update that allows it to greatly increase the size of its memory state and, consequently, memory capacity. We develop a theoretical basis for OVQ-attention based on Gaussian mixture regression, and we test it on a variety of synthetic long context tasks and on long context language modeling. OVQ-attention shows significant improvements over linear attention baselines and the original VQ-attention, on which OVQ-attention was inspired. It demonstrates competitive, and sometimes identical, performance to strong self-attention baselines up 64k sequence length, despite using a small fraction of the memory of full self-attention.
Problem

Research questions and friction points this paper is trying to address.

sequence mixing
long-context processing
memory-compute trade-off
attention mechanism
language modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

online vector-quantized attention
linear complexity
constant memory
long-context modeling
sparse memory update
🔎 Similar Papers
No similar papers found.
N
Nick Alonso
Zyphra
T
Tomas Figliolia
Zyphra
Beren Millidge
Beren Millidge
Postdoctoral Researcher, University of Oxford