π€ AI Summary
Standard residual networks struggle to effectively model non-monotonic and complex dynamics due to the additive inductive bias inherent in their architecture. This work proposes a deep Delta learning framework that introduces a learnable Delta operator, unifying identity mapping, orthogonal projection, and geometric reflection through rank-1 perturbations. By modulating identity shortcut connections with data-dependent geometric transformations, the method enables fine-grained control over inter-layer state transitions. Furthermore, a gating mechanism is incorporated as a dynamic step size to simultaneously regulate information erasure and writing. This approach significantly enhances the networkβs capacity to model complex dynamical systems while preserving training stability.
π Abstract
The effectiveness of deep residual networks hinges on the identity shortcut connection. While this mechanism alleviates the vanishing-gradient problem, it also has a strictly additive inductive bias on feature transformations, limiting the network's ability to model complex hidden state transitions. In this paper, we introduce \textbf{Deep Delta Learning (DDL)}, which generalizes the shortcut from a fixed identity map to a learnable, state-dependent linear operator. The resulting Delta Operator is a rank-1 perturbation of the identity, $\mathbf{A}(\mathbf{X}) = \mathbf{I}- \beta(\mathbf{X})\mathbf{k} (\mathbf{X}) \mathbf{k} (\mathbf{X})^\top$, parameterized by a unit direction $\mathbf{k}(\mathbf{X})$ and a scalar gate $\beta(\mathbf{X})$. We provide a spectral analysis showing that $\beta(\mathbf{X})$ continuously interpolates the shortcut between identity ($\beta=0$), orthogonal projection ($\beta=1$), and Householder reflection ($\beta=2$). Furthermore, we rewrite the residual update as a synchronized rank-1 delta write: $\beta$ scales both the removal of the current $\mathbf{k}$-component and the injection of the new $\mathbf{k}$-component. This unification enables explicit control of the shortcut spectrum along a data-dependent direction while retaining stable training behavior. Empirically, replacing Transformer residual additions with DDL improves validation loss and perplexity, as well as downstream evaluation accuracy on language modeling tasks, with larger gains in the expanded-state setting.