Balancing Computation Load and Representation Expressivity in Parallel Hybrid Neural Networks

📅 2025-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the dual challenges of computational load imbalance and representational degradation in parallel hybrid neural networks, this paper proposes FlowHN: a novel architecture featuring a FLOP-aware dynamic token routing mechanism for adaptive load balancing, coupled with a learnable heterogeneous branch fusion module integrating Transformer self-attention and state-space models to jointly preserve representation fidelity and expressive capacity. Its key innovations include gradient-coordinated output fusion and MFU-driven load scheduling—breaking performance bottlenecks inherent in conventional parallel hybrid models. Evaluated on language models ranging from 135M to 1B parameters, FlowHN achieves up to a 4× throughput improvement and a 2× MFU gain over baselines, while significantly outperforming both serial architectures and existing parallel hybrid approaches in accuracy.

Technology Category

Application Category

📝 Abstract
Attention and State-Space Models (SSMs) when combined in a hybrid network in sequence or in parallel provide complementary strengths. In a hybrid sequential pipeline they alternate between applying a transformer to the input and then feeding its output into a SSM. This results in idle periods in the individual components increasing end-to-end latency and lowering throughput caps. In the parallel hybrid architecture, the transformer operates independently in parallel with the SSM, and these pairs are cascaded, with output from one pair forming the input to the next. Two issues are (i) creating an expressive knowledge representation with the inherently divergent outputs from these separate branches, and (ii) load balancing the computation between these parallel branches, while maintaining representation fidelity. In this work we present FlowHN, a novel parallel hybrid network architecture that accommodates various strategies for load balancing, achieved through appropriate distribution of input tokens between the two branches. Two innovative differentiating factors in FlowHN include a FLOP aware dynamic token split between the attention and SSM branches yielding efficient balance in compute load, and secondly, a method to fuse the highly divergent outputs from individual branches for enhancing representation expressivity. Together they enable much better token processing speeds, avoid bottlenecks, and at the same time yield significantly improved accuracy as compared to other competing works. We conduct comprehensive experiments on autoregressive language modeling for models with 135M, 350M, and 1B parameters. FlowHN outperforms sequential hybrid models and its parallel counterpart, achieving up to 4* higher Tokens per Second (TPS) and 2* better Model FLOPs Utilization (MFU).
Problem

Research questions and friction points this paper is trying to address.

Balancing computation load in parallel hybrid neural networks
Enhancing representation expressivity from divergent parallel branches
Improving token processing speed and accuracy simultaneously
Innovation

Methods, ideas, or system contributions that make the work stand out.

Parallel hybrid network with dynamic token split
FLOP aware load balancing between branches
Fusion method for divergent branch outputs
🔎 Similar Papers
No similar papers found.
M
Mohammad Mahdi Moradi
Department of Computer Science, Concordia University, Ascend Team, Huawei Technologies
Walid Ahmed
Walid Ahmed
Huawei Technologies Canada
Deep LearningMachine LearningSoft Computing
S
Shuangyue Wen
Ascend Team, Toronto Research Center, Huawei Technologies
Sudhir Mudur
Sudhir Mudur
Professor of Computer Science, Concordia University
Computer Graphics
W
Weiwei Zhang
Ascend Team, Toronto Research Center, Huawei Technologies
Y
Yang Liu
Ascend Team, Toronto Research Center, Huawei Technologies