ParaDySe: A Parallel-Strategy Switching Framework for Dynamic Sequence Lengths in Transformer

📅 2025-11-17

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Static parallelism strategies in large Transformer training struggle to adapt to dynamic sequence lengths: short sequences trigger Communication Parallelism Cancellation (CPC), while long sequences cause out-of-memory (OOM) errors. To address this, we propose a sequence-aware dynamic parallelism switching framework. Our method introduces a modular parallel library built upon a unified tensor layout, enabling fine-grained, hot-swappable inter-layer parallelism selection driven by sequence length. We further develop a lightweight hybrid memory and time cost model and integrate a heuristic algorithm for real-time optimal parallelism decisions. Evaluated on 624K-length sequence training, our framework completely eliminates both OOM and CPC bottlenecks, significantly improving training stability and GPU resource utilization. Key contributions include: (1) the first sequence-length–driven dynamic parallelism switching mechanism; (2) a unified, modular parallel library supporting heterogeneous layer-wise strategies; and (3) an efficient hybrid cost model with real-time decision capability.

Technology Category

Application Category

📝 Abstract

Dynamic sequences with varying lengths have been widely used in the training of Transformer-based large language models (LLMs). However, current training frameworks adopt a pre-defined static parallel strategy for these sequences, causing neither communication-parallelization cancellation on short sequences nor out-of-memory on long sequences. To mitigate these issues, we propose ParaDySe, a novel adaptive Parallel strategy switching framework for Dynamic Sequences. ParaDySe enables on-the-fly optimal strategy adoption according to the immediate input sequence. It first implements the modular function libraries for parallel strategies with unified tensor layout specifications, and then builds sequence-aware memory and time cost models with hybrid methods. Guided by cost models, ParaDySe selects optimal layer-wise strategies for dynamic sequences via an efficient heuristic algorithm. By integrating these techniques together, ParaDySe achieves seamless hot-switching of optimal strategies through its well-designed function libraries. We compare ParaDySe with baselines on representative LLMs under datasets with sequence lengths up to 624K. Experimental results indicate that ParaDySe addresses OOM and CPC bottlenecks in LLM training by systematically integrating long-sequence optimizations with existing frameworks.

Problem

Research questions and friction points this paper is trying to address.

Optimizing parallel strategies for dynamic sequence lengths in Transformer training

Addressing memory and communication inefficiencies in large language models

Enabling adaptive strategy switching for varying input sequence lengths

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive parallel strategy switching for dynamic sequences

Hybrid cost models guiding optimal layer-wise strategy selection

Modular function libraries enabling seamless hot-switching capability

🔎 Similar Papers

DSP: Dynamic Sequence Parallelism for Multi-Dimensional Transformers