🤖 AI Summary
Existing hyperconnection methods are constrained by the Birkhoff polytope, leading to identity degeneration, limited expressivity, and poor parameter efficiency, which disrupt the identity mapping of residual connections and cause training instability. This work proposes spectral-sphere-constrained hyperconnections (sHC), which relocate the feasible set of residual matrices onto the spectral norm sphere. By introducing spectral-sphere geometric constraints for the first time, sHC overcomes the non-negativity limitation inherent in prior approaches, enabling negative-valued entries that facilitate feature subtraction and disentanglement. The method eliminates the need for Sinkhorn iterations or factorial-scale parameters, achieving both parameter efficiency and stable training. sHC substantially enhances model expressiveness and cross-stream interaction flexibility, effectively supporting selective feature diversification and addressing key limitations of current hyperconnection strategies.
📝 Abstract
Hyper-Connections (HC) generalize residual connections into multiple streams, employing residual matrices for cross-stream feature mixing to enrich model expressivity. However, unconstrained mixing disrupts the identity mapping property intrinsic to the residual connection, causing unstable training. To address this, Manifold-Constrained Hyper-Connections (mHC) and its variant restrict these matrices to the Birkhoff polytope (doubly stochastic matrices) via Sinkhorn iterations or permutation-based parameterizations. We reveal three limitations of this polytope constraint: (1) identity degeneration, where learned matrices collapse around the identity and diminish cross-stream interactions, (2) an expressivity bottleneck, as the non-negativity constraint prevents subtractive feature disentanglement, and (3) parameterization inefficiencies, manifesting as unstable Sinkhorn iterations or the factorial-scaling overhead of permutation-based parameterizations. To overcome these flaws, we propose Spectral-Sphere-Constrained Hyper-Connections (sHC). By geometrically shifting the feasible set from a rigid polytope to a spectral norm sphere, sHC allows negative entries, unlocking subtractive interactions for selective feature diversification. This shift eliminates unstable Sinkhorn projections and factorial parameterization, enabling expressive, non-degenerate residual matrices while preserving training stability.