Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation

๐Ÿ“… 2025-10-28
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the challenge of efficiently scaling cross-modal perception and generation across vision, speech, and language for Artificial General Intelligence (AGI). We propose a sparse unified multimodal architecture integrating sparse Mixture-of-Experts (MoE), context-aware automatic speech recognition (ASR), high-resolution controllable image generation, and generative segmentation to enable joint training and inference over all three modalities. Our key contributions are: (i) the first sparse architecture simultaneously supporting high-fidelity text rendering, cross-modal consistent editing, and dialect-robust ASR; and (ii) significantly improved spatial consistency in image editing via generative segmentation. Experiments demonstrate state-of-the-art performance on 12 ASR benchmarks, as well as new records on text-to-image generation and segmentation tasksโ€”achieving this while maintaining computational efficiency and scalable model capacity.

Technology Category

Application Category

๐Ÿ“ Abstract
We propose Ming-Flash-Omni, an upgraded version of Ming-Omni, built upon a sparser Mixture-of-Experts (MoE) variant of Ling-Flash-2.0 with 100 billion total parameters, of which only 6.1 billion are active per token. This architecture enables highly efficient scaling (dramatically improving computational efficiency while significantly expanding model capacity) and empowers stronger unified multimodal intelligence across vision, speech, and language, representing a key step toward Artificial General Intelligence (AGI). Compared to its predecessor, the upgraded version exhibits substantial improvements across multimodal understanding and generation. We significantly advance speech recognition capabilities, achieving state-of-the-art performance in contextual ASR and highly competitive results in dialect-aware ASR. In image generation, Ming-Flash-Omni introduces high-fidelity text rendering and demonstrates marked gains in scene consistency and identity preservation during image editing. Furthermore, Ming-Flash-Omni introduces generative segmentation, a capability that not only achieves strong standalone segmentation performance but also enhances spatial control in image generation and improves editing consistency. Notably, Ming-Flash-Omni achieves state-of-the-art results in text-to-image generation and generative segmentation, and sets new records on all 12 contextual ASR benchmarks, all within a single unified architecture.
Problem

Research questions and friction points this paper is trying to address.

Scaling multimodal AI efficiently with sparse Mixture-of-Experts architecture
Advancing unified perception and generation across vision, speech, and language
Achieving state-of-the-art performance in contextual ASR and text-to-image generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Mixture-of-Experts architecture with 100B parameters
Unified multimodal intelligence across vision, speech, language
Generative segmentation enhances spatial control in image generation
๐Ÿ”Ž Similar Papers
No similar papers found.
Bowen Ma
Bowen Ma
Senior Research Associate, The University of Hong Kong; PhD, UT-Austin; BSc, USTC
Condensed Matter Theory
C
Cheng Zou
Inclusion AI, Ant Group
C
Canxiang Yan
Inclusion AI, Ant Group
C
Chunxiang Jin
Inclusion AI, Ant Group
C
Chunjie Shen
Inclusion AI, Ant Group
D
Dandan Zheng
Inclusion AI, Ant Group
Fudong Wang
Fudong Wang
Unknown affiliation
computer vision3D scene/human modelingoptimization
Furong Xu
Furong Xu
Ant Group
Computer VisionDeep LearningImage/Video RetrievalRepresentation Learning
G
GuangMing Yao
Inclusion AI, Ant Group
J
Jun Zhou
Inclusion AI, Ant Group
J
Jingdong Chen
Inclusion AI, Ant Group
J
Jianing Li
Inclusion AI, Ant Group
J
Jianxin Sun
Inclusion AI, Ant Group
Jiajia Liu
Jiajia Liu
Ant Group
cv multimodal
J
Jianjiang Zhu
Inclusion AI, Ant Group
Jianping Jiang
Jianping Jiang
Peking University
Mixed RealityMultimodal Learning
Jun Peng
Jun Peng
PhD, Soochow University, Australian National University
Photovoltaics
Kaixiang Ji
Kaixiang Ji
Ant Group
Computer VisionMultimodal
K
Kaimeng Ren
Inclusion AI, Ant Group
L
Libin Wang
Inclusion AI, Ant Group
Lixiang Ru
Lixiang Ru
Ant Group
computer visionMLLMmulti-modal learningremote sensing
L
Longhua Tan
Inclusion AI, Ant Group
L
Lan Wang
Inclusion AI, Ant Group