🤖 AI Summary
Transformer architectures suffer from empirical design, limited interpretability, and structural redundancy—particularly due to non-essential components like feed-forward networks (FFNs) and LayerNorm. Method: Grounded in subspace denoising theory, this work rigorously derives a purely attention-based, interpretable Transformer: representation learning is formalized as iterative compression and denoising of noisy tokens onto a low-dimensional subspace, naturally yielding a minimal architecture comprising only self-attention and skip connections. A novel multi-head subspace self-attention mechanism is introduced, unifying mathematical interpretability with structural minimality. Contribution/Results: We prove that each layer improves the signal-to-noise ratio at a linear rate. Empirically, the model matches GPT-2 and CRATE performance on vision and language benchmarks, demonstrating the dispensability of FFNs and LayerNorm. This establishes a new paradigm for principled, lightweight, and efficient large-model design.
📝 Abstract
Despite the popularity of transformers in practice, their architectures are empirically designed and neither mathematically justified nor interpretable. Moreover, as indicated by many empirical studies, some components of transformer architectures may be redundant. To derive a fully interpretable transformer architecture with only necessary components, we contend that the goal of representation learning is to compress a set of noisy initial token representations towards a mixture of low-dimensional subspaces. To compress these noisy token representations, an associated denoising operation naturally takes the form of a multi-head (subspace) self-attention. By unrolling such iterative denoising operations into a deep network, we arrive at a highly compact architecture that consists of extit{only} self-attention operators with skip connections at each layer. Moreover, we show that each layer performs highly efficient denoising: it improves the signal-to-noise ratio of token representations extit{at a linear rate} with respect to the number of layers. Despite its simplicity, extensive experiments on vision and language tasks demonstrate that such a transformer achieves performance close to that of standard transformer architectures such as GPT-2 and CRATE.