ParaDySe: A Parallel-Strategy Switching Framework for Dynamic Sequence Lengths in Transformer

๐Ÿ“… 2025-11-17
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Static parallelism strategies in large Transformer training struggle to adapt to dynamic sequence lengths: short sequences trigger Communication Parallelism Cancellation (CPC), while long sequences cause out-of-memory (OOM) errors. To address this, we propose a sequence-aware dynamic parallelism switching framework. Our method introduces a modular parallel library built upon a unified tensor layout, enabling fine-grained, hot-swappable inter-layer parallelism selection driven by sequence length. We further develop a lightweight hybrid memory and time cost model and integrate a heuristic algorithm for real-time optimal parallelism decisions. Evaluated on 624K-length sequence training, our framework completely eliminates both OOM and CPC bottlenecks, significantly improving training stability and GPU resource utilization. Key contributions include: (1) the first sequence-lengthโ€“driven dynamic parallelism switching mechanism; (2) a unified, modular parallel library supporting heterogeneous layer-wise strategies; and (3) an efficient hybrid cost model with real-time decision capability.

Technology Category

Application Category

๐Ÿ“ Abstract
Dynamic sequences with varying lengths have been widely used in the training of Transformer-based large language models (LLMs). However, current training frameworks adopt a pre-defined static parallel strategy for these sequences, causing neither communication-parallelization cancellation on short sequences nor out-of-memory on long sequences. To mitigate these issues, we propose ParaDySe, a novel adaptive Parallel strategy switching framework for Dynamic Sequences. ParaDySe enables on-the-fly optimal strategy adoption according to the immediate input sequence. It first implements the modular function libraries for parallel strategies with unified tensor layout specifications, and then builds sequence-aware memory and time cost models with hybrid methods. Guided by cost models, ParaDySe selects optimal layer-wise strategies for dynamic sequences via an efficient heuristic algorithm. By integrating these techniques together, ParaDySe achieves seamless hot-switching of optimal strategies through its well-designed function libraries. We compare ParaDySe with baselines on representative LLMs under datasets with sequence lengths up to 624K. Experimental results indicate that ParaDySe addresses OOM and CPC bottlenecks in LLM training by systematically integrating long-sequence optimizations with existing frameworks.
Problem

Research questions and friction points this paper is trying to address.

Optimizing parallel strategies for dynamic sequence lengths in Transformer training
Addressing memory and communication inefficiencies in large language models
Enabling adaptive strategy switching for varying input sequence lengths
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive parallel strategy switching for dynamic sequences
Hybrid cost models guiding optimal layer-wise strategy selection
Modular function libraries enabling seamless hot-switching capability
๐Ÿ”Ž Similar Papers
No similar papers found.
Z
Zhixin Ou
College of Computer Science and Technology, National University of Defense Technology
Peng Liang
Peng Liang
School of Computer Science, Wuhan University
Software EngineeringSoftware ArchitectureEmpirical Software Engineering
J
Jianchen Han
College of Computer Science and Technology, National University of Defense Technology
B
Baihui Liu
College of Computer Science and Technology, National University of Defense Technology
Linbo Qiao
Linbo Qiao
NUDT
Stochastic OptimizationDistributed OptimizationLarge-scale Machine Learning