Revisiting Residual Connections: Orthogonal Updates for Stable and Efficient Deep Networks

📅 2025-05-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Standard residual updates in deep neural networks superimpose module outputs onto input features along the same direction, often leading to redundant feature learning and insufficient representational diversity. To address this, we propose Orthogonal Residual Update (ORU), the first method to constrain residuals strictly to the orthogonal complement of the input subspace: module outputs are orthogonally projected, and only their orthogonal components are injected—thereby compelling successive modules to learn complementary representation directions. Our approach comprises residual structure reparameterization and a plug-and-play module design, seamlessly integrating with both ResNetV2 and Vision Transformer architectures. On ImageNet-1k, ORU boosts ViT-B’s top-1 accuracy by 4.3 percentage points. Moreover, it significantly improves generalization and training convergence robustness on CIFAR-10/100 and TinyImageNet. These results empirically validate the efficacy of orthogonal priors in enhancing representation learning.

Technology Category

Application Category

📝 Abstract
Residual connections are pivotal for deep neural networks, enabling greater depth by mitigating vanishing gradients. However, in standard residual updates, the module's output is directly added to the input stream. This can lead to updates that predominantly reinforce or modulate the existing stream direction, potentially underutilizing the module's capacity for learning entirely novel features. In this work, we introduce Orthogonal Residual Update: we decompose the module's output relative to the input stream and add only the component orthogonal to this stream. This design aims to guide modules to contribute primarily new representational directions, fostering richer feature learning while promoting more efficient training. We demonstrate that our orthogonal update strategy improves generalization accuracy and training stability across diverse architectures (ResNetV2, Vision Transformers) and datasets (CIFARs, TinyImageNet, ImageNet-1k), achieving, for instance, a +4.3%p top-1 accuracy gain for ViT-B on ImageNet-1k.
Problem

Research questions and friction points this paper is trying to address.

Mitigating vanishing gradients in deep networks
Enhancing novel feature learning in residual connections
Improving generalization accuracy and training stability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Orthogonal decomposition of residual updates
Adds only orthogonal component to input
Improves generalization and training stability
🔎 Similar Papers
No similar papers found.