Cutting the Skip: Training Residual-Free Transformers

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

138K/year

🤖 AI Summary

This work investigates whether Transformers without residual connections can be stably trained. We identify gradient degradation—caused by the absence of skip connections—as the core obstacle, and analyze its origin via Jacobian conditioning, revealing a progressive ill-conditioning of layer-wise gradients during backpropagation. Based on this analysis, we propose a principled initialization scheme that mitigates gradient vanishing without altering the network architecture. Our method enables stable end-to-end optimization of standard Vision Transformer (ViT) models—both in supervised and self-supervised settings—without any residual connections for the first time. Crucially, it demonstrates that residual connections are not a necessary condition for ViT trainability. Empirical evaluation shows that our residual-free ViTs outperform strong residual-based baselines on dense prediction tasks, confirming their superior representational capacity and generalization. This challenges conventional architectural assumptions and offers a new perspective on hierarchical representation learning.

Technology Category

Application Category

📝 Abstract

Transformers have achieved remarkable success across a wide range of applications, a feat often attributed to their scalability. Yet training them without skip (residual) connections remains notoriously difficult. While skips stabilize optimization, they also disrupt the hierarchical structure of representations, raising the long-standing question of whether transformers can be trained efficiently without them. In this work, we address this problem by analyzing the Jacobian of a skipless transformer block, showing why skips improve conditioning and revealing that their stabilization benefits can be recovered through a principled initialization strategy. Building on this insight, we introduce the first method that enables stable and efficient training of skipless transformers without altering the standard architecture. We validate our approach on Vision Transformers (ViTs) in both supervised and self-supervised settings, demonstrating that skipless ViTs trained with our initialization overcome the usual optimization barriers, learn richer hierarchical representations, and outperform strong baselines, that incorporate skip connections, on dense prediction benchmarks. These results show that skip connections are not a fundamental requirement for training ViTs and open new avenues for hierarchical representation learning in vision models.

Problem

Research questions and friction points this paper is trying to address.

Training transformers without skip connections remains notoriously difficult

Skip connections disrupt hierarchical structure of representations

Developing method for stable skipless transformer training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Principled initialization strategy replaces skip connections

Stable training of skipless transformers without architecture changes

Method enables hierarchical representation learning in vision models

🔎 Similar Papers

T-FREE: Subword Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings