π€ AI Summary
This work addresses the high computational cost of Transformers in processing long sequences, which stems from the linear growth of fast weight memory in attention mechanisms with sequence length. The authors reformulate attention as a dynamically instantiated two-layer fast-weight MLP and introduce the MiTA mechanism: by compressing the original wide MLP through a small set of landmark queries and integrating Top-k activated key-value pairs to construct deformable experts, MiTA unifies model compression with sparse routing. This approach represents the first effort to embed efficient attention within a fast-weight scaling framework, significantly reducing computational complexity while preserving expressive capacity. Preliminary vision experiments demonstrate MiTAβs effectiveness and potential in long-context scenarios.
π Abstract
The attention operator in Transformers can be viewed as a two-layer fast-weight MLP, whose weights are dynamically instantiated from input tokens and whose width equals sequence length N. As the context extends, the expressive capacity of such an N-width MLP increases, but scaling its fast weights becomes prohibitively expensive for extremely long sequences. Recently, this fast-weight scaling perspective has motivated the Mixture-of-Experts (MoE) attention, which partitions the sequence into fast-weight experts and sparsely routes the tokens to them. In this paper, we elevate this perspective to a unifying framework for a wide range of efficient attention methods by interpreting them as scaling fast weights through routing and/or compression. Then we propose a compress-and-route strategy, which compresses the N-width MLP into a narrower one using a small set of landmark queries and constructs deformable experts by gathering top-k activated key-value pairs for each landmark query. We call this strategy a Mixture of Top-k Activations (MiTA), and refer to the resulting efficient mechanism as MiTA attention. Preliminary experiments on vision tasks demonstrate the promise of our MiTA attention and motivate further investigation on its optimization and broader applications in more challenging settings.