Degrees of Freedom for Linear Attention: Distilling Softmax Attention with Optimal Feature Efficiency

📅 2025-07-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the imbalance between approximation accuracy and computational efficiency in linear attention mechanisms caused by manually specified feature dimensions. We propose a layer-adaptive feature dimension selection method grounded in statistical degrees of freedom (DoF), which quantifies the effective input dimensionality per layer to dynamically allocate computational resources. Under a fixed budget, our method theoretically guarantees tighter approximation error to softmax attention. Furthermore, we uncover the depth-wise evolution pattern of attention complexity across Transformer layers. Integrated with nonlinear feature mapping and hierarchical distillation, our approach enables efficient soft-attention distillation on multiple pretrained Transformers—achieving zero inference overhead, significant performance gains, and fidelity close to the teacher model. The core innovation lies in introducing statistical DoF into linear attention architecture design, enabling, for the first time, theoretically justified, layer-aware, and computation-aware dimension adaptation.

Technology Category

Application Category

📝 Abstract
Linear attention has attracted interest as a computationally efficient approximation to softmax attention, especially for long sequences. Recent studies have explored distilling softmax attention in pre-trained Transformers into linear attention. However, a critical challenge remains: how to choose the feature dimension that governs the approximation quality. Existing methods fix this dimension uniformly across all attention layers, overlooking the diverse roles and complexities of them. In this paper, we propose a principled method to automatically determine the feature dimension in linear attention using the concept of statistical degrees of freedom, which represent the effective dimensionality of the inputs. We provide a theoretical bound on the approximation error and show that the dimension chosen by our method achieves smaller error under a fixed computational budget. Furthermore, we introduce an efficient layerwise training strategy to learn nonlinear features tailored to each layer. Experiments on multiple pre-trained transformers demonstrate that our method improves the performance of distilled models compared to baselines without increasing the inference cost. Our findings also provide insight into how the complexity of the attention mechanism evolves across layers.
Problem

Research questions and friction points this paper is trying to address.

Choosing optimal feature dimension for linear attention approximation
Automating feature dimension selection using statistical degrees of freedom
Improving distilled linear attention performance without increasing computational cost
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatically determine feature dimension using degrees of freedom
Theoretical bound on approximation error for optimal performance
Layerwise training strategy for tailored nonlinear features
🔎 Similar Papers
No similar papers found.