Ming-Omni: A Unified Multimodal Model for Perception and Generation

📅 2025-06-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work introduces the first open-source unified multimodal foundation model, addressing key limitations of existing approaches—including reliance on multiple specialized models, task-specific fine-tuning, or architectural reconfiguration. Methodologically, it proposes a novel modality-specific Mixture-of-Experts (MoE) router and the Ling backbone architecture, integrating dedicated modality encoders, the Ming-Lite-Uni image generator, and an end-to-end audio decoder to enable joint perception and generation across text, images, audio, and video. The model supports cross-modal understanding, contextual multimodal dialogue, text-to-speech (TTS), high-fidelity image editing, and speech synthesis—achieving GPT-4o–level performance in modality coverage and generative capability. All code and model weights are publicly released, establishing a foundational infrastructure and technical paradigm for research on unified multimodal foundation models.

Technology Category

Application Category

📝 Abstract
We propose Ming-Omni, a unified multimodal model capable of processing images, text, audio, and video, while demonstrating strong proficiency in both speech and image generation. Ming-Omni employs dedicated encoders to extract tokens from different modalities, which are then processed by Ling, an MoE architecture equipped with newly proposed modality-specific routers. This design enables a single model to efficiently process and fuse multimodal inputs within a unified framework, thereby facilitating diverse tasks without requiring separate models, task-specific fine-tuning, or structural redesign. Importantly, Ming-Omni extends beyond conventional multimodal models by supporting audio and image generation. This is achieved through the integration of an advanced audio decoder for natural-sounding speech and Ming-Lite-Uni for high-quality image generation, which also allow the model to engage in context-aware chatting, perform text-to-speech conversion, and conduct versatile image editing. Our experimental results showcase Ming-Omni offers a powerful solution for unified perception and generation across all modalities. Notably, our proposed Ming-Omni is the first open-source model we are aware of to match GPT-4o in modality support, and we release all code and model weights to encourage further research and development in the community.
Problem

Research questions and friction points this paper is trying to address.

Unified multimodal processing for images, text, audio, video
Efficient fusion of diverse inputs without separate models
First open-source model matching GPT-4o in modality support
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified multimodal model for perception and generation
MoE architecture with modality-specific routers
Advanced audio and image generation decoders
🔎 Similar Papers
No similar papers found.
A
AI Inclusion
Biao Gong
Biao Gong
Ant Group | Alibaba Group
Generative ModelRetrieval3D Vision
C
Cheng Zou
C
Chuanyang Zheng
Chunluan Zhou
Chunluan Zhou
Nanyang Technological University
computer visiondeep learningobject detectionpedestrian detection
C
Canxiang Yan
C
Chunxiang Jin
C
Chunjie Shen
D
Dandan Zheng
Fudong Wang
Fudong Wang
Unknown affiliation
computer vision3D scene/human modelingoptimization
Furong Xu
Furong Xu
Ant Group
Computer VisionDeep LearningImage/Video RetrievalRepresentation Learning
G
GuangMing Yao
J
Jun Zhou
J
Jingdong Chen
J
Jianxin Sun
Jiajia Liu
Jiajia Liu
Ant Group
cv multimodal
J
Jianjiang Zhu
Jun Peng
Jun Peng
PhD, Soochow University, Australian National University
Photovoltaics
Kaixiang Ji
Kaixiang Ji
Ant Group
Computer VisionMultimodal
Kaiyou Song
Kaiyou Song
Huazhong University of Science and Technology
Visual PerceptionSelf-Supervised LearningMLLMEmbodied-AI
K
Kaimeng Ren
L
Libin Wang
Lixiang Ru
Lixiang Ru
Ant Group
computer visionMLLMmulti-modal learningremote sensing
Lele Xie
Lele Xie
south china university of technology
Multimodal Large Language Modelcomputer visionOCRobject detectiondeep learning
L
Longhua Tan
L
Lyuxin Xue
L
Lan Wang
M
Mochen Bai
N
Ning Gao
P
Pei Chen
Qingpei Guo
Qingpei Guo
Ant Group
Multimodal LLMsVision-Language Models
Q
Qinglong Zhang
Q
Qiang Xu
R
Rui Liu
R
Ruijie Xiong
S
Sirui Gao
T
Tinghao Liu
T
Taisong Li
W
Weilong Chai
X
Xinyu Xiao
X
Xiaomei Wang
X
Xiaoxue Chen
Xiao Lu
Xiao Lu
X
Xiaoyu Li
X
Xingning Dong
X
Xuzheng Yu
Yi Yuan
Yi Yuan
NetEase Fuxi AI Lab
deep learningcomputer vision
Y
Yuting Gao
Y
Yunxiao Sun
Y
Yipeng Chen
Y
Yifei Wu
Y
Yongjie Lyu
Z
Ziping Ma
Z
Zipeng Feng
Z
Zhijiang Fang
Z
Zhihao Qiu
Z
Ziyuan Huang
Z
Zhengyu He