mHC-lite: You Don't Need 20 Sinkhorn-Knopp Iterations

📅 2026-01-09
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Existing hyperconnection methods rely on Sinkhorn–Knopp iterations to approximate doubly stochastic matrices, suffering from insufficient accuracy, implementation complexity, and training instability. This work proposes mHC-lite, the first approach that explicitly constructs exact doubly stochastic residual matrices via convex combinations of permutation matrices, grounded in the Birkhoff–von Neumann theorem. By doing so, it entirely eliminates the accumulation of approximation errors. The method requires only native matrix operations—no custom CUDA kernels or iterative normalization—dramatically simplifying implementation. Experimental results demonstrate that mHC-lite matches or exceeds the performance of mHC while substantially improving training throughput and completely removing residual instability.

Technology Category

Application Category

📝 Abstract
Hyper-Connections (HC) generalizes residual connections by introducing dynamic residual matrices that mix information across multiple residual streams, accelerating convergence in deep neural networks. However, unconstrained residual matrices can compromise training stability. To address this, DeepSeek's Manifold-Constrained Hyper-Connections (mHC) approximately projects these matrices onto the Birkhoff polytope via iterative Sinkhorn--Knopp (SK) normalization. We identify two limitations of this approach: (i) finite SK iterations do not guarantee exact doubly stochasticity, leaving an approximation gap that can accumulate through network depth and undermine stability; (ii) efficient SK implementation requires highly specialized CUDA kernels, raising engineering barriers and reducing portability. Motivated by the Birkhoff--von Neumann theorem, we propose mHC-lite, a simple reparameterization that explicitly constructs doubly stochastic matrices as convex combinations of permutation matrices. This approach guarantees exact doubly stochasticity by construction and can be implemented using only native matrix operations. Extensive experiments demonstrate that mHC-lite matches or exceeds mHC in performance while achieving higher training throughput with a naive implementation and eliminating the residual instabilities observed in both HC and mHC. The code is publicly available at https://github.com/FFTYYY/mhc-lite.
Problem

Research questions and friction points this paper is trying to address.

Hyper-Connections
doubly stochastic matrices
Sinkhorn-Knopp normalization
training stability
Birkhoff polytope
Innovation

Methods, ideas, or system contributions that make the work stand out.

doubly stochastic matrix
Birkhoff–von Neumann theorem
hyper-connections
Sinkhorn-Knopp normalization
reparameterization
🔎 Similar Papers
No similar papers found.
Yongyi Yang
Yongyi Yang
University of Michigan
Machine learningGraph neural networks
J
Jianyang Gao
CCDS, Nanyang Technological University, Singapore