🤖 AI Summary
To address memory constraints and poor scalability in long-sequence post-training of large language models (LLMs), this paper proposes a plug-and-play sequence parallelism (SP) mechanism seamlessly integrated into the LLaMA-Factory framework. Methodologically, we design a lightweight plugin architecture that natively supports multi-mode sequence partitioning—namely, split, interleave, and reduce—while implementing efficient gradient synchronization and dynamic communication scheduling via PyTorch. The approach is fully compatible with the Hugging Face ecosystem and requires no model architecture modifications. Experimental results demonstrate substantial GPU memory reduction, enabling long-sequence post-training for models including Light-R1, TinyR1, and the Kaggle AIMO mathematical reasoning model. The solution has been adopted as a core component in proprietary training frameworks by multiple industry-leading enterprises.
📝 Abstract
Adding sequence parallelism into LLaMA-Factory, we open-sourced 360-LLaMA-Factory at https://github.com/Qihoo360/360-LLaMA-Factory. 360-LLaMA-Factory has received wide recognition and used in models such as Light-R1 arXiv:2503.10460, TinyR1 arXiv:2503.04872, Kaggle AIMO math models and also in large companies' training frameworks. This technical report delves deeper into the different sequence parallel modes behind 360-LLaMA-Factory and discusses our implementation insights.