Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data

📅 2025-11-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limitations of existing open-source multimodal large language models (MLLMs) in cross-modal collaboration efficiency and generation controllability, this paper introduces OLM—a language-centric open-source multimodal LLM. Methodologically: (1) a dynamic-capacity Mixture-of-Experts (MoE) architecture is designed, incorporating shared, routed, and null experts to balance computational load; (2) Omni-Modality 3D RoPE positional encoding is proposed to jointly model spatiotemporal structures across text, images, and speech; (3) progressive supervised fine-tuning is combined with a two-stage reinforcement learning framework—GSPO followed by DPO—to jointly optimize reasoning and generation. Integrated atop the Qwen2.5-7B dense backbone, OLM supports ten-way cross-modal input/output. Experiments demonstrate state-of-the-art performance across 85 benchmarks: on 76 comparable tasks, OLM outperforms Qwen2.5-Omni in 50, achieving +7% average gain in video understanding, +4% in audio-visual reasoning, −4.2% WER in speech recognition, and markedly improved controllability in image generation.

Technology Category

Application Category

📝 Abstract
We present Uni-MoE 2.0 from the Lychee family. As a fully open-source omnimodal large model (OLM), it substantially advances Lychee's Uni-MoE series in language-centric multimodal understanding, reasoning, and generating. Based on the Qwen2.5-7B dense architecture, we build Uni-MoE-2.0-Omni from scratch through three core contributions: dynamic-capacity Mixture-of-Experts (MoE) design, a progressive training strategy enhanced with an iterative reinforcement strategy, and a carefully curated multimodal data matching technique. It is capable of omnimodal understanding, as well as generating images, text, and speech. Architecturally, our new MoE framework balances computational efficiency and capability for 10 cross-modal inputs using shared, routed, and null experts, while our Omni-Modality 3D RoPE ensures spatio-temporal cross-modality alignment in the self-attention layer. For training, following cross-modal pretraining, we use a progressive supervised fine-tuning strategy that activates modality-specific experts and is enhanced by balanced data composition and an iterative GSPO-DPO method to stabilise RL training and improve reasoning. Data-wise, the base model, trained on approximately 75B tokens of open-source multimodal data, is equipped with special speech and image generation tokens, allowing it to learn these generative tasks by conditioning its outputs on linguistic cues. Extensive evaluation across 85 benchmarks demonstrates that our model achieves SOTA or highly competitive performance against leading OLMs, surpassing Qwen2.5-Omni (trained with 1.2T tokens) on over 50 of 76 benchmarks. Key strengths include video understanding (+7% avg. of 8), omnimodallity understanding (+7% avg. of 4), and audiovisual reasoning (+4%). It also advances long-form speech processing (reducing WER by 4.2%) and leads in low-level image processing and controllable generation across 5 metrics.
Problem

Research questions and friction points this paper is trying to address.

Developing a scalable omnimodal AI model for multimodal understanding and generation
Balancing computational efficiency with cross-modal capability through MoE architecture
Achieving competitive performance across diverse multimodal benchmarks and tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic-capacity MoE design balances efficiency and capability
Progressive training strategy enhanced with iterative reinforcement learning
Carefully curated multimodal data matching technique for alignment
🔎 Similar Papers
No similar papers found.
Y
Yunxin Li
Research Institute of Computing and Intelligence, Harbin Institute of Technology, Shenzhen
X
Xinyu Chen
Research Institute of Computing and Intelligence, Harbin Institute of Technology, Shenzhen
S
Shenyuan Jiang
Research Institute of Computing and Intelligence, Harbin Institute of Technology, Shenzhen
H
Haoyuan Shi
Research Institute of Computing and Intelligence, Harbin Institute of Technology, Shenzhen
Z
Zhenyu Liu
Research Institute of Computing and Intelligence, Harbin Institute of Technology, Shenzhen
X
Xuanyu Zhang
Research Institute of Computing and Intelligence, Harbin Institute of Technology, Shenzhen
N
Nanhao Deng
Research Institute of Computing and Intelligence, Harbin Institute of Technology, Shenzhen
Zhenran Xu
Zhenran Xu
Harbin Institute of Technology (Shenzhen)
Natural Language ProcessingLanguage Agent
Y
Yicheng Ma
Research Institute of Computing and Intelligence, Harbin Institute of Technology, Shenzhen
Meishan Zhang
Meishan Zhang
Associate Professor, Harbin Institute of Technology at Shenzhen
Natural Language ProcessingComputational LinguisticsSyntax ParsingSentiment AnalysisMachine
Baotian Hu
Baotian Hu
Harbin Institute of Technology (Shenzhen)
LLMMLLMNLP
M
Min Zhang
Research Institute of Computing and Intelligence, Harbin Institute of Technology, Shenzhen