MiTA Attention: Efficient Fast-Weight Scaling via a Mixture of Top-k Activations

πŸ“… 2026-02-01
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the high computational cost of Transformers in processing long sequences, which stems from the linear growth of fast weight memory in attention mechanisms with sequence length. The authors reformulate attention as a dynamically instantiated two-layer fast-weight MLP and introduce the MiTA mechanism: by compressing the original wide MLP through a small set of landmark queries and integrating Top-k activated key-value pairs to construct deformable experts, MiTA unifies model compression with sparse routing. This approach represents the first effort to embed efficient attention within a fast-weight scaling framework, significantly reducing computational complexity while preserving expressive capacity. Preliminary vision experiments demonstrate MiTA’s effectiveness and potential in long-context scenarios.

Technology Category

Application Category

πŸ“ Abstract
The attention operator in Transformers can be viewed as a two-layer fast-weight MLP, whose weights are dynamically instantiated from input tokens and whose width equals sequence length N. As the context extends, the expressive capacity of such an N-width MLP increases, but scaling its fast weights becomes prohibitively expensive for extremely long sequences. Recently, this fast-weight scaling perspective has motivated the Mixture-of-Experts (MoE) attention, which partitions the sequence into fast-weight experts and sparsely routes the tokens to them. In this paper, we elevate this perspective to a unifying framework for a wide range of efficient attention methods by interpreting them as scaling fast weights through routing and/or compression. Then we propose a compress-and-route strategy, which compresses the N-width MLP into a narrower one using a small set of landmark queries and constructs deformable experts by gathering top-k activated key-value pairs for each landmark query. We call this strategy a Mixture of Top-k Activations (MiTA), and refer to the resulting efficient mechanism as MiTA attention. Preliminary experiments on vision tasks demonstrate the promise of our MiTA attention and motivate further investigation on its optimization and broader applications in more challenging settings.
Problem

Research questions and friction points this paper is trying to address.

attention
fast-weight
long sequence
efficient attention
scaling
Innovation

Methods, ideas, or system contributions that make the work stand out.

MiTA Attention
fast-weight scaling
Mixture of Top-k Activations
efficient attention
landmark queries
πŸ”Ž Similar Papers
No similar papers found.
Q
Qishuai Wen
School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing 100876, P.R. China
Zhiyuan Huang
Zhiyuan Huang
University of Science and Technology of China
SimulationData-Driven Optimization
X
Xianghan Meng
School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing 100876, P.R. China
Wei He
Wei He
Beijing University of Posts and Telecommunication
Chun-Guang Li
Chun-Guang Li
Associate Professor, Beijing University of Posts and Telecommunications
Subspace ClusteringSelf-Supervised LearningTime Series ModelingBiomedical Engineering