SlimPack: Fine-Grained Asymmetric Packing for Balanced and Efficient Variable-Length LLM Training

📅 2025-09-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
LLM distributed training suffers from severe load imbalance and low hardware utilization due to highly heterogeneous sequence lengths, which undermine conventional packing strategies. Method: We propose a fine-grained data slicing and heterogeneous scheduling framework: (i) sample-level dynamic slicing coupled with asynchronous forward/backward grouping, decoupling memory, communication, and computation constraints; (ii) a heterogeneous partitioning algorithm with a two-stage solver for runtime load balancing under multi-dimensional parallelism; and (iii) a high-fidelity simulator-driven co-scheduling mechanism. Contribution/Results: Experiments show up to 2.8× higher training throughput over state-of-the-art baselines, significantly improved GPU utilization, and balanced gains in both memory efficiency and communication efficiency.

Technology Category

Application Category

📝 Abstract
The efficient distributed training of Large Language Models (LLMs) is severely hampered by the extreme variance in context lengths. This data heterogeneity, amplified by conventional packing strategies and asymmetric forward-backward costs, leads to critical inefficiencies such as cascading workload imbalances and severe hardware underutilization. Existing solutions attempt to mitigate these challenges, but often at the expense of memory or communication efficiency. To address these challenges, we introduce SlimPack, a framework that fundamentally rethinks data packing and scheduling by decomposing samples into fine-grained slices. This slice-level decomposition immediately mitigates critical memory and communication bottlenecks by transforming large, volatile workloads into a stream of smaller, manageable units. This flexibility is then harnessed for our core innovation, Asymmetric Partitioning, which assembles balanced scheduling units uniquely optimized for the different demands of the forward and backward passes. Orchestrated by a two-phase solver and a high-fidelity simulator, SlimPack holistically resolves imbalances across all parallel dimensions. Extensive experiments demonstrate that SlimPack achieves up to a $2.8 imes$ training throughput improvement over baselines, breaking the conventional trade-off by delivering both superior balance and high resource efficiency.
Problem

Research questions and friction points this paper is trying to address.

Addresses workload imbalance in distributed LLM training from variable-length data
Optimizes packing strategy to resolve memory and communication bottlenecks
Improves hardware utilization by balancing forward-backward pass scheduling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained slice decomposition for workload management
Asymmetric partitioning for forward-backward pass optimization
Two-phase solver with simulator for parallel dimension balancing
🔎 Similar Papers
No similar papers found.
Y
Yuliang Liu
Kling Infra, Kuaishou Technology
G
Guohao Wu
Kling Infra, Kuaishou Technology
S
Shenglong Zhang
Kling Infra, Kuaishou Technology
W
Wei Zhang
Kling Infra, Kuaishou Technology
Qianchao Zhu
Qianchao Zhu
peking university
High Performance ComputingMachine Learning System
Z
Zhouyang Li
Kling Infra, Kuaishou Technology
C
Chenyu Wang
Kling Infra, Kuaishou Technology