M2-omni: Advancing Omni-MLLM for Comprehensive Modality Support with Competitive Performance

📅 2025-02-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses performance imbalance in multimodal large language models (MLLMs) arising from severe inter-modal data disparity and heterogeneous convergence rates during training. To mitigate this, we propose a phased dynamic balancing strategy: (1) step-level modality sampling during pretraining to alleviate data quantity imbalances, and (2) adaptive loss weighting during instruction tuning to enforce synchronized convergence across modalities. Built upon a unified multimodal sequence modeling framework, our approach integrates a cross-modal alignment encoder and a modality-aware dynamic loss mechanism, coupled with a progressive instruction-tuning pipeline. The resulting open-source fully multimodal model, M2-omni, supports arbitrary combinations of audio, video, image, and text inputs, as well as interleaved multimodal outputs. On multimodal understanding and generation benchmarks, M2-omni matches GPT-4o’s performance while retaining strong robustness on pure-text tasks—establishing it as the state-of-the-art open-source fully multimodal LLM.

Technology Category

Application Category

📝 Abstract
We present M2-omni, a cutting-edge, open-source omni-MLLM that achieves competitive performance to GPT-4o. M2-omni employs a unified multimodal sequence modeling framework, which empowers Large Language Models(LLMs) to acquire comprehensive cross-modal understanding and generation capabilities. Specifically, M2-omni can process arbitrary combinations of audio, video, image, and text modalities as input, generating multimodal sequences interleaving with audio, image, or text outputs, thereby enabling an advanced and interactive real-time experience. The training of such an omni-MLLM is challenged by significant disparities in data quantity and convergence rates across modalities. To address these challenges, we propose a step balance strategy during pre-training to handle the quantity disparities in modality-specific data. Additionally, a dynamically adaptive balance strategy is introduced during the instruction tuning stage to synchronize the modality-wise training progress, ensuring optimal convergence. Notably, we prioritize preserving strong performance on pure text tasks to maintain the robustness of M2-omni's language understanding capability throughout the training process. To our best knowledge, M2-omni is currently a very competitive open-source model to GPT-4o, characterized by its comprehensive modality and task support, as well as its exceptional performance. We expect M2-omni will advance the development of omni-MLLMs, thus facilitating future research in this domain.
Problem

Research questions and friction points this paper is trying to address.

Enhancing multimodal understanding and generation
Balancing training across diverse modalities
Maintaining text task performance robustness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified multimodal sequence modeling
Step balance strategy pre-training
Dynamically adaptive balance instruction tuning
🔎 Similar Papers
No similar papers found.
Qingpei Guo
Qingpei Guo
Ant Group
Multimodal LLMsVision-Language Models
Kaiyou Song
Kaiyou Song
Huazhong University of Science and Technology
Visual PerceptionSelf-Supervised LearningMLLMEmbodied-AI
Z
Zipeng Feng
Ant Group
Z
Ziping Ma
Ant Group
Q
Qinglong Zhang
Ant Group
S
Sirui Gao
Ant Group
X
Xuzheng Yu
Ant Group
Y
Yunxiao Sun
Ant Group
T
Tai-Wei Chang
Ant Group
J
Jingdong Chen
Ant Group
M
Ming Yang
Ant Group
J
Jun Zhou
Ant Group