Why Attend to Everything? Focus is the Key

📅 2026-03-12

📈 Citations: 0

✨ Influential: 0

📄 PDF

career value

172K/year

🤖 AI Summary

针对标准注意力计算复杂度高且高效方法易损性能的问题，提出Focus方法，通过可学习质心动态筛选关键token对，在保持预训练模型权重冻结的同时实现高效、无损甚至更优的性能。

📝 Abstract

We introduce Focus, a method that learns which token pairs matter rather than approximating all of them. Learnable centroids assign tokens to groups; distant attention is restricted to same-group pairs while local attention operates at full resolution. Because all model weights stay frozen, Focus is purely additive: centroid-only training (as few as 148K parameters) improves domain perplexity with zero degradation on downstream benchmarks--from 124M to 70B parameters, across five attention architectures. No existing efficient attention method achieves this in the retrofit setting. At 124M, Focus surpasses full attention (30.3 vs 31.4 PPL); trained from scratch at 7B scale (2B tokens), Focus again beats full attention (13.82 vs 13.89 PPL). At inference, restricting each token to its top-k highest-scoring groups discretizes the soft routing into a hard sparsity pattern, yielding 2x speedup while beating the pretrained baseline (41.3 vs 42.8 PPL); decomposing this pattern into two standard FlashAttention calls reaches 8.6x wall-clock speedup at 1M tokens with no custom kernels. Unlike LoRA, centroid routing preserves alignment: instruction-tuned models retain TruthfulQA scores after adaptation, while LoRA degrades at every learning rate and rank. Sinkhorn normalization enforces balanced groups as a hard constraint, and the resulting groups discover interpretable linguistic categories without supervision.

Problem

Research questions and friction points this paper is trying to address.

attention mechanism

computational complexity

efficient attention

pretrained models

sequence length

Innovation

Methods, ideas, or system contributions that make the work stand out.

efficient attention

learnable centroids

sparse attention