🤖 AI Summary
While hyperconnectivity can enhance model performance, it disrupts the identity mapping inherent in residual architectures, leading to unstable training, poor scalability, and excessive memory consumption. This work proposes a trainable linear mixer operating across parallel streams, constrained on a manifold of operators with bounded norm to explicitly regulate gradient conditioning and ensure both training stability and efficiency. Key innovations include Jacobian spectral prediction grounded in free probability theory, a memory-efficient projection algorithm driven by implicit differentiation, and an orthogonal mixer on the Stiefel manifold constructed via the Cayley transform. Evaluated on the ARC-AGI benchmark, the method substantially outperforms doubly stochastic baselines, achieving faster convergence, higher accuracy, and reduced computational cost.
📝 Abstract
Recent advances in deep learning, exemplified by Hyper-Connections (HC), have expanded the residual connection paradigm by introducing wider residual streams and diverse connectivity patterns. While these innovations yield significant performance gains, they compromise the identity mapping property of residual connections, leading to training instability, limited scalability, and increased memory overhead. To address these challenges, we propose JPmHC (Jacobian-spectrum Preserving manifold-constrained Hyper-Connections), a framework that replaces identity skips with a trainable linear mixer acting on n parallel streams while explicitly controlling gradient conditioning. By constraining the mixer M on operator-norm-bounded manifolds (e.g., bistochastic, Stiefel, Grassmann), JPmHC prevents gradient pathologies and enhances stability. JPmHC introduces three key contributions: (i) a free-probability analysis that predicts Jacobian spectra for structured skips, providing actionable design rules for mixer selection; (ii) memory-efficient implicit differentiation for fixed-point projections, reducing activation memory and synchronization overhead; and (iii) a Stiefel-constrained mixer via Cayley transforms, ensuring orthogonality without post-hoc normalization. Empirical evaluations on ARC-AGI demonstrate that JPmHC achieves faster convergence, higher accuracy, and lower computational cost compared to bistochastic baselines. As a flexible and scalable extension of HC, JPmHC advances spectrum-aware, stable, and efficient deep learning, offering insights into topological architecture design and foundational model evolution.