🤖 AI Summary
Training multimodal large language models (MLLMs) faces challenges including deep coupling of heterogeneous architectures, rigid parallelization logic, poor system scalability, and high engineering overhead. To address these, we propose a model-centric distributed training framework that decouples model definition from communication and parallelization logic, enabling a plug-and-play three-dimensional parallelism strategy library for low-overhead, highly scalable training of arbitrary multimodal models. The framework features modular design, native support for Mixture-of-Experts (MoE) architectures, and flexible configuration interfaces, significantly reducing integration complexity for new modalities. Experiments on a 128-GPU cluster achieve a per-GPU throughput exceeding 2,800 tokens/s and natively support context lengths up to 160K. This substantially improves training efficiency and scalability for large-scale multimodal LLMs.
📝 Abstract
Recent advances in large language models (LLMs) have driven impressive progress in omni-modal understanding and generation. However, training omni-modal LLMs remains a significant challenge due to the heterogeneous model architectures required to process diverse modalities, necessitating sophisticated system design for efficient large-scale training. Existing frameworks typically entangle model definition with parallel logic, incurring limited scalability and substantial engineering overhead for end-to-end omni-modal training. %
We present veomni, a modular and efficient training framework to accelerate the development of omni-modal LLMs. veomni introduces model-centric distributed recipes that decouples communication from computation, enabling efficient 3D parallelism on omni-modal LLMs. veomni also features a flexible configuration interface supporting seamless integration of new modalities with minimal code change. %
Using veomni, a omni-modal mixture-of-experts (MoE) model with 30B parameters can be trained with over 2,800 tokens/sec/GPU throughput and scale to 160K context lengths via 3D parallelism on 128 GPUs, showcasing its superior efficiency and scalability for training large omni-modal LLMs.