Model Parallelism With Subnetwork Data Parallelism

📅 2025-07-11

📈 Citations: 0

✨ Influential: 0

career value

234K/year

🤖 AI Summary

To address high per-node memory pressure and substantial intra-node communication overhead in distributed pre-training of large-scale models, this paper proposes Subnet Data Parallel (SDP): each worker node independently trains a structured, compact subnetwork, eliminating activation transmission across pipeline stages. SDP integrates stochastic block dropping with width-wise subnetwork construction to ensure uniform parameter coverage and gradient alignment across distributed workers. Its communication bandwidth requirement is comparable to or lower than that of standard all-reduce operations. Experiments demonstrate that SDP reduces GPU memory consumption by 20–40% without compromising model accuracy, while preserving convergence properties and significantly improving training efficiency for large models.

Technology Category

Application Category

📝 Abstract

Distributed pre-training of large models at scale often imposes heavy memory demands on individual nodes and incurs significant intra-node communication costs. We propose a novel alternative approach that reduces the memory requirements by training small, structured subnetworks of the model on separate workers. Unlike pipelining, our method avoids inter-node activation communication and maintains bandwidth requirements that are comparable to or lower than standard data parallel communication schemes based on all-reduce. We evaluate two subnetwork construction strategies guided by the principle of ensuring uniform representation of each parameter across the distributed training setup. Our results show that the stochastic block dropping technique consistently outperforms the width-wise subnetwork construction previously explored in federated learning. We empirically attribute this superior performance to stronger gradient alignment in subnetworks that retain blocks having skip connections. Preliminary experiments highlight the promise of our approach, achieving a 20-40% reduction in memory usage without any loss in performance.

Problem

Research questions and friction points this paper is trying to address.

Reduces memory demands in distributed pre-training of large models

Avoids inter-node activation communication with structured subnetworks

Achieves lower memory usage without performance loss

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training subnetworks on separate workers

Avoiding inter-node activation communication

Reducing memory usage by 20-40%

🔎 Similar Papers

Enhancing Large-Scale AI Training Efficiency: The C4 Solution for Real-Time Anomaly Detection and Communication Optimization