VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo

📅 2025-08-04

📈 Citations: 0

✨ Influential: 0

career value

240K/year

🤖 AI Summary

Training multimodal large language models (MLLMs) faces challenges including deep coupling of heterogeneous architectures, rigid parallelization logic, poor system scalability, and high engineering overhead. To address these, we propose a model-centric distributed training framework that decouples model definition from communication and parallelization logic, enabling a plug-and-play three-dimensional parallelism strategy library for low-overhead, highly scalable training of arbitrary multimodal models. The framework features modular design, native support for Mixture-of-Experts (MoE) architectures, and flexible configuration interfaces, significantly reducing integration complexity for new modalities. Experiments on a 128-GPU cluster achieve a per-GPU throughput exceeding 2,800 tokens/s and natively support context lengths up to 160K. This substantially improves training efficiency and scalability for large-scale multimodal LLMs.

Technology Category

Application Category

📝 Abstract

Recent advances in large language models (LLMs) have driven impressive progress in omni-modal understanding and generation. However, training omni-modal LLMs remains a significant challenge due to the heterogeneous model architectures required to process diverse modalities, necessitating sophisticated system design for efficient large-scale training. Existing frameworks typically entangle model definition with parallel logic, incurring limited scalability and substantial engineering overhead for end-to-end omni-modal training. % We present veomni, a modular and efficient training framework to accelerate the development of omni-modal LLMs. veomni introduces model-centric distributed recipes that decouples communication from computation, enabling efficient 3D parallelism on omni-modal LLMs. veomni also features a flexible configuration interface supporting seamless integration of new modalities with minimal code change. % Using veomni, a omni-modal mixture-of-experts (MoE) model with 30B parameters can be trained with over 2,800 tokens/sec/GPU throughput and scale to 160K context lengths via 3D parallelism on 128 GPUs, showcasing its superior efficiency and scalability for training large omni-modal LLMs.

Problem

Research questions and friction points this paper is trying to address.

Heterogeneous model architectures hinder omni-modal LLM training

Existing frameworks entangle model definition with parallel logic

Limited scalability and high overhead in omni-modal training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Model-centric distributed recipes decouple communication and computation

Flexible configuration interface for seamless new modality integration

Efficient 3D parallelism for large omni-modal model training

🔎 Similar Papers

DistTrain: Addressing Model and Data Heterogeneity with Disaggregated Training for Multimodal Large Language Models