Frac-Connections: Fractional Extension of Hyper-Connections

📅 2025-03-18

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Existing hyper-connections address the intrinsic trade-off between gradient vanishing and representation collapse in deep neural networks but incur prohibitive GPU memory overhead by expanding hidden-state dimensionality. This work proposes Frac-Connections: a lightweight multi-depth connectivity architecture that, for the first time, realizes the multi-scale strength principle of hyper-connections via *fractional blocking*—dynamically partitioning fixed-width hidden states into blocks and assigning each block a learnable, depth-dependent connection weight. Crucially, this design modulates gradient flow and information pathways across depths without increasing parameter count or memory footprint. Evaluated on a 7B MoE model trained on 3T tokens, Frac-Connections significantly outperforms standard residual connections, accelerating convergence and improving final performance on both language modeling and downstream tasks.

Technology Category

Application Category

📝 Abstract

Residual connections are central to modern deep learning architectures, enabling the training of very deep networks by mitigating gradient vanishing. Hyper-Connections recently generalized residual connections by introducing multiple connection strengths at different depths, thereby addressing the seesaw effect between gradient vanishing and representation collapse. However, Hyper-Connections increase memory access costs by expanding the width of hidden states. In this paper, we propose Frac-Connections, a novel approach that divides hidden states into multiple parts rather than expanding their width. Frac-Connections retain partial benefits of Hyper-Connections while reducing memory consumption. To validate their effectiveness, we conduct large-scale experiments on language tasks, with the largest being a 7B MoE model trained on up to 3T tokens, demonstrating that Frac-Connections significantly outperform residual connections.

Problem

Research questions and friction points this paper is trying to address.

Addresses gradient vanishing and representation collapse in deep networks.

Reduces memory consumption by dividing hidden states into parts.

Improves performance in large-scale language tasks with reduced memory usage.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Divides hidden states into multiple parts

Reduces memory consumption effectively

Outperforms traditional residual connections

🔎 Similar Papers

The Role of Graph Topology in the Performance of Biomedical Knowledge Graph Completion Models