Rethinking Dynamic Networks and Heterogeneous Computing with Automatic Parallelization

📅 2025-06-03

📈 Citations: 0

✨ Influential: 0

career value

240K/year

🤖 AI Summary

Existing automatic parallelism planning frameworks for LLM training overlook the coupled impact of node heterogeneity and dynamic network topology, resulting in coarse-grained workload allocation and low search efficiency. This paper introduces the first joint modeling of computational node heterogeneity and time-varying network topology, proposing a simulation-driven automatic parallelism planning method. We design a heterogeneity-aware, fine-grained workload allocation mechanism and integrate policy pruning to drastically reduce the parallel configuration search space. Experiments demonstrate that, under dynamic heterogeneous settings—such as cloud environments—our approach improves training throughput by 18.7%, accelerates search speed by 3.2×, and matches state-of-the-art methods in end-to-end performance. The method establishes a scalable, adaptive parallel optimization paradigm for large-scale distributed LLM training.

Technology Category

Application Category

📝 Abstract

Hybrid parallelism techniques are essential for efficiently training large language models (LLMs). Nevertheless, current automatic parallel planning frameworks often overlook the simultaneous consideration of node heterogeneity and dynamic network topology changes, limiting their effectiveness in practical applications. In this paper, we address these limitations by modeling heterogeneous nodes within dynamically changing network environments and leveraging simulation-based strategies to determine optimal parallel configurations. Our approach enables fine-grained workload allocation tailored for heterogeneous nodes and complex network scenarios, achieving performance competitive with state-of-the-art methods under regular and stable network conditions. Additionally, we introduce a strategy pruning technique to rapidly discard infeasible parallel configurations, substantially reducing the search space and accelerating the search process through parallel execution within the simulator. Preliminary evaluations confirm that our method notably enhances training performance on heterogeneous nodes and demonstrates improved adaptability in complex, dynamic scenarios such as cloud computing environments.

Problem

Research questions and friction points this paper is trying to address.

Optimizing parallel training for large language models (LLMs)

Addressing node heterogeneity and dynamic network changes

Enhancing performance in complex, dynamic computing environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Modeling heterogeneous nodes in dynamic networks

Simulation-based optimal parallel configuration

Strategy pruning to accelerate search process

🔎 Similar Papers

No similar papers found.