Attention-Only Transformers via Unrolled Subspace Denoising

📅 2025-06-04
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Transformer architectures suffer from empirical design, limited interpretability, and structural redundancy—particularly due to non-essential components like feed-forward networks (FFNs) and LayerNorm. Method: Grounded in subspace denoising theory, this work rigorously derives a purely attention-based, interpretable Transformer: representation learning is formalized as iterative compression and denoising of noisy tokens onto a low-dimensional subspace, naturally yielding a minimal architecture comprising only self-attention and skip connections. A novel multi-head subspace self-attention mechanism is introduced, unifying mathematical interpretability with structural minimality. Contribution/Results: We prove that each layer improves the signal-to-noise ratio at a linear rate. Empirically, the model matches GPT-2 and CRATE performance on vision and language benchmarks, demonstrating the dispensability of FFNs and LayerNorm. This establishes a new paradigm for principled, lightweight, and efficient large-model design.

Technology Category

Application Category

📝 Abstract
Despite the popularity of transformers in practice, their architectures are empirically designed and neither mathematically justified nor interpretable. Moreover, as indicated by many empirical studies, some components of transformer architectures may be redundant. To derive a fully interpretable transformer architecture with only necessary components, we contend that the goal of representation learning is to compress a set of noisy initial token representations towards a mixture of low-dimensional subspaces. To compress these noisy token representations, an associated denoising operation naturally takes the form of a multi-head (subspace) self-attention. By unrolling such iterative denoising operations into a deep network, we arrive at a highly compact architecture that consists of extit{only} self-attention operators with skip connections at each layer. Moreover, we show that each layer performs highly efficient denoising: it improves the signal-to-noise ratio of token representations extit{at a linear rate} with respect to the number of layers. Despite its simplicity, extensive experiments on vision and language tasks demonstrate that such a transformer achieves performance close to that of standard transformer architectures such as GPT-2 and CRATE.
Problem

Research questions and friction points this paper is trying to address.

Designing interpretable transformer architectures with necessary components
Reducing redundancy in transformer components via denoising
Achieving efficient representation learning with attention-only layers
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unrolled subspace denoising for transformers
Attention-only architecture with skip connections
Linear rate signal-to-noise ratio improvement
🔎 Similar Papers
No similar papers found.