Symmetry Breaking in Transformers for Efficient and Interpretable Training

📅 2026-01-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the redundant rotational degrees of freedom inherent in the standard Transformer attention mechanism, which, while not affecting model outputs, complicate optimization and hinder interpretability. The authors propose the first application of symmetry breaking to Transformers by introducing fixed, non-learnable query and value biases generated via batch sampling, thereby embedding a preferred direction into the attention computation to break rotational symmetry. This minimal architectural modification substantially improves training efficiency with memory-efficient optimizers such as SGD with momentum (SGDM), narrowing the performance gap with adaptive optimizers. The approach achieves strong results on downstream logical reasoning tasks and validation loss, while simultaneously enhancing model interpretability by clearly amplifying the internal semantic structure within attention heads.

Technology Category

Application Category

📝 Abstract
The attention mechanism in its standard implementation contains extraneous rotational degrees of freedom that are carried through computation but do not affect model activations or outputs. We introduce a simple symmetry-breaking protocol that inserts a preferred direction into this rotational space through batchwise-sampled, unlearned query and value biases. This modification has two theoretically motivated and empirically validated consequences. First, it can substantially improve the performance of simple, memory-efficient optimizers, narrowing -- and in some cases closing -- the gap to successful but more complex memory-intensive adaptive methods. We demonstrate this by pretraining 124M parameter transformer models with four optimization algorithms (AdamW, SOAP, SGDM, and Energy Conserving Descent(ECD)) and evaluating both validation loss and downstream logical reasoning. Second, it enables an interpretable use of otherwise redundant rotational degrees of freedom, selectively amplifying semantically meaningful token classes within individual attention heads. Overall, our results show that minimal, principled architectural changes can simultaneously improve performance and interpretability.
Problem

Research questions and friction points this paper is trying to address.

symmetry breaking
attention mechanism
rotational degrees of freedom
training efficiency
interpretability
Innovation

Methods, ideas, or system contributions that make the work stand out.

symmetry breaking
attention mechanism
memory-efficient optimization
interpretable transformers
rotational degrees of freedom
🔎 Similar Papers
No similar papers found.