Efficient Long Context Fine-tuning with Chunk Flow

📅 2025-03-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address two key challenges in long-context fine-tuning—data’s heavy-tailed distribution (dominated by short sequences with few long ones) and load imbalance in distributed training (data-parallel workload skew and severe pipeline-parallel bubbles)—this paper proposes ChunkFlow, a chunk-centric training paradigm. ChunkFlow uniformly partitions and recombines input sequences into fixed-size chunks and introduces a state-aware dynamic chunk scheduling mechanism, ensuring peak GPU memory consumption depends solely on chunk size—not maximum sequence length. The design natively integrates with mainstream pipeline-parallel schedulers (e.g., Megatron-LM). Experiments demonstrate that ChunkFlow accelerates long-context fine-tuning by up to 4.53× over Megatron-LM, significantly improves GPU utilization, and supports diverse scenarios including continual pretraining.

Technology Category

Application Category

📝 Abstract
Long context fine-tuning of large language models(LLMs) involves training on datasets that are predominantly composed of short sequences and a small proportion of longer sequences. However, existing approaches overlook this long-tail distribution and employ training strategies designed specifically for long sequences. Moreover, these approaches also fail to address the challenges posed by variable sequence lengths during distributed training, such as load imbalance in data parallelism and severe pipeline bubbles in pipeline parallelism. These issues lead to suboptimal training performance and poor GPU resource utilization. To tackle these problems, we propose a chunk-centric training method named ChunkFlow. ChunkFlow reorganizes input sequences into uniformly sized chunks by consolidating short sequences and splitting longer ones. This approach achieves optimal computational efficiency and balance among training inputs. Additionally, ChunkFlow incorporates a state-aware chunk scheduling mechanism to ensure that the peak memory usage during training is primarily determined by the chunk size rather than the maximum sequence length in the dataset. Integrating this scheduling mechanism with existing pipeline scheduling algorithms further enhances the performance of distributed training. Experimental results demonstrate that, compared with Megatron-LM, ChunkFlow can be up to 4.53x faster in the long context fine-tuning of LLMs. Furthermore, we believe that ChunkFlow serves as an effective solution for a broader range of scenarios, such as long context continual pre-training, where datasets contain variable-length sequences.
Problem

Research questions and friction points this paper is trying to address.

Addresses inefficiency in long context fine-tuning of LLMs.
Solves load imbalance and pipeline bubbles in distributed training.
Improves GPU resource utilization with chunk-centric training.
Innovation

Methods, ideas, or system contributions that make the work stand out.

ChunkFlow reorganizes sequences into uniform chunks
State-aware scheduling optimizes memory usage
Enhances distributed training efficiency significantly
🔎 Similar Papers
No similar papers found.
Xiulong Yuan
Xiulong Yuan
Alibaba Cloud
Hongtao Xu
Hongtao Xu
Fudan Univeristy
Professor
Wenting Shen
Wenting Shen
Qingdao University
cloud computing,data integrity auditing
Ang Wang
Ang Wang
Alibaba
X
Xiafei Qiu
Alibaba Group
J
Jie Zhang
Alibaba Group
Y
Yuqiong Liu
Alibaba Group
Bowen Yu
Bowen Yu
Qwen Team, Alibaba Group
Post-trainingFoundation Model
Junyang Lin
Junyang Lin
Qwen Team, Alibaba Group & Peking University
Natural Language ProcessingCross-Modal Representation LearningPretraining
M
Mingzhen Li
State Key Lab of Processors, Institute of Computing Technology, CAS
W
Weile Jia
State Key Lab of Processors, Institute of Computing Technology, CAS
Y
Yong Li
Alibaba Group
W
Wei Lin
Alibaba Group