🤖 AI Summary
During long-sequence training of large language models, backpropagation suffers from explosive activation memory growth; gradient checkpointing remains insufficient for ultra-long sequences. This paper proposes StreamBP: the first exact, recomputation-free backpropagation algorithm achieving linear memory complexity with respect to sequence length. It decomposes the chain rule intra-layer along the sequence dimension, synergistically integrating causal-mask-driven gradient flow optimization and communication-aware distributed design. The method is natively compatible with Transformer’s causal structure and supports diverse training objectives—including supervised fine-tuning (SFT), GRPO, and DPO. Experiments demonstrate that StreamBP increases the maximum trainable sequence length by 2.8–5.5× over gradient checkpointing, while maintaining comparable or lower backward-pass latency. It integrates seamlessly into existing training pipelines without architectural or workflow modifications. The implementation is open-sourced.
📝 Abstract
Training language models on long sequence data is a demanding requirement for enhancing the model's capability on complex tasks, e.g., long-chain reasoning. However, as the sequence length scales up, the memory cost for storing activation values becomes huge during the Backpropagation (BP) process, even with the application of gradient checkpointing technique. To tackle this challenge, we propose a memory-efficient and exact BP method called StreamBP, which performs a linear decomposition of the chain rule along the sequence dimension in a layer-wise manner, significantly reducing the memory cost of activation values and logits. The proposed method is applicable to common objectives such as SFT, GRPO, and DPO. From an implementation perspective, StreamBP achieves less computational FLOPs and faster BP speed by leveraging the causal structure of the language model. Compared to gradient checkpointing, StreamBP scales up the maximum sequence length of BP by 2.8-5.5 times larger, while using comparable or even less BP time. Note that StreamBP's sequence length scaling ability can be directly transferred to batch size scaling for accelerating training. We further develop a communication-efficient distributed StreamBP to effectively support multi-GPU training and broaden its applicability. Our code can be easily integrated into the training pipeline of any transformer models and is available at https://github.com/Ledzy/StreamBP.