HyGra: Accelerating Network-State Simulation for LLM Training in DCNs via Adaptive Packet-Flow Granularity

📅 2026-03-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing network simulators struggle to simultaneously achieve efficiency and fidelity in large language model (LLM) training: packet-level simulation offers high accuracy but incurs substantial overhead, whereas flow-level simulation is efficient yet introduces significant distortion. This work proposes a hybrid-granularity network state simulator that, for the first time, incorporates an adaptive granularity-switching mechanism—employing packet-level simulation during non-steady-state phases and dynamically transitioning to flow-level simulation during steady-state periods to balance accuracy and speed. The approach requires no specialized hardware, runs on a single machine, and integrates seamlessly with existing simulators. Experiments on real-world LLM workloads, including ChatGPT, DeepSeek, and Qwen, demonstrate speedups of up to 15.4× under single parallelism strategies and 7.8× under hybrid parallelism, all while preserving high simulation fidelity.

Technology Category

Application Category

📝 Abstract
In recent years, large language models (LLMs) have driven substantial intelligent transformation across diverse industries. Commercial LLM training is typically performed over data center networks (DCNs) comprising hundreds to thousands of GPUs, with multiple devices collocated per node. As network scale expands, inter-node communication becomes a primary bottleneck to training efficiency. Network-state simulators therefore play a crucial role by enabling cost-effective evaluation of network configurations and parallelization strategies through faithful emulation of DCN dynamics during LLM training. However, existing simulators are constrained by a efficiency-fidelity tradeoff, as packet-level simulators (PLSs) incur prohibitive runtime overhead, whereas flow-level simulators (FLSs) compromise essential modeling accuracy. In this paper, we develop \texttt{HyGra}, a hybrid-granularity network-state simulator that exploits intrinsic network dynamics in LLM training to adaptively switch simulation granularity. Specifically, \texttt{HyGra} employs packet-level simulation during non-steady phases with transient fluctuations and flow-level simulation during steady phases with periodic patterns, thereby accelerating execution while preserving high fidelity. Moreover, it requires no specialized hardware, supports single-machine deployment, and is compatible with existing simulators. Experiments based representative commercial LLM workloads, including ChatGPT, DeepSeek, and Qwen, show that \texttt{HyGra} achieves up to 15.4$\times$ speedup under single parallelization strategy and 7.8$\times$ under hybrid parallelization strategies while maintaining high accuracy.
Problem

Research questions and friction points this paper is trying to address.

network-state simulation
LLM training
data center networks
simulation granularity
efficiency-fidelity tradeoff
Innovation

Methods, ideas, or system contributions that make the work stand out.

hybrid-granularity simulation
adaptive packet-flow granularity
network-state simulation
LLM training acceleration
data center networks
🔎 Similar Papers
No similar papers found.
Wenyi Wang
Wenyi Wang
University of Chicago
Parallel Computing
Z
Zheng Wu
College of Computer, Nanjing University of Posts and Telecommunications, Nanjing 210023, China
Y
Yanmeng Wang
College of Computer, Nanjing University of Posts and Telecommunications, Nanjing 210023, China
H
Haolin Mao
College of Computer, Nanjing University of Posts and Telecommunications, Nanjing 210023, China
L
Lei Han
College of Computer, Nanjing University of Posts and Telecommunications, Nanjing 210023, China
G
Gaogang Xie
Computer Network Information Center, Chinese Academy of Sciences
F
Fu Xiao
College of Computer, Nanjing University of Posts and Telecommunications, Nanjing 210023, China