Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation

📅 2025-10-28

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

This work addresses the challenge of efficiently scaling cross-modal perception and generation across vision, speech, and language for Artificial General Intelligence (AGI). We propose a sparse unified multimodal architecture integrating sparse Mixture-of-Experts (MoE), context-aware automatic speech recognition (ASR), high-resolution controllable image generation, and generative segmentation to enable joint training and inference over all three modalities. Our key contributions are: (i) the first sparse architecture simultaneously supporting high-fidelity text rendering, cross-modal consistent editing, and dialect-robust ASR; and (ii) significantly improved spatial consistency in image editing via generative segmentation. Experiments demonstrate state-of-the-art performance on 12 ASR benchmarks, as well as new records on text-to-image generation and segmentation tasks—achieving this while maintaining computational efficiency and scalable model capacity.

Technology Category

Application Category

📝 Abstract

We propose Ming-Flash-Omni, an upgraded version of Ming-Omni, built upon a sparser Mixture-of-Experts (MoE) variant of Ling-Flash-2.0 with 100 billion total parameters, of which only 6.1 billion are active per token. This architecture enables highly efficient scaling (dramatically improving computational efficiency while significantly expanding model capacity) and empowers stronger unified multimodal intelligence across vision, speech, and language, representing a key step toward Artificial General Intelligence (AGI). Compared to its predecessor, the upgraded version exhibits substantial improvements across multimodal understanding and generation. We significantly advance speech recognition capabilities, achieving state-of-the-art performance in contextual ASR and highly competitive results in dialect-aware ASR. In image generation, Ming-Flash-Omni introduces high-fidelity text rendering and demonstrates marked gains in scene consistency and identity preservation during image editing. Furthermore, Ming-Flash-Omni introduces generative segmentation, a capability that not only achieves strong standalone segmentation performance but also enhances spatial control in image generation and improves editing consistency. Notably, Ming-Flash-Omni achieves state-of-the-art results in text-to-image generation and generative segmentation, and sets new records on all 12 contextual ASR benchmarks, all within a single unified architecture.

Problem

Research questions and friction points this paper is trying to address.

Scaling multimodal AI efficiently with sparse Mixture-of-Experts architecture

Advancing unified perception and generation across vision, speech, and language

Achieving state-of-the-art performance in contextual ASR and text-to-image generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Mixture-of-Experts architecture with 100B parameters

Unified multimodal intelligence across vision, speech, language

Generative segmentation enhances spatial control in image generation

🔎 Similar Papers

A Markov Random Field Multi-Modal Variational AutoEncoder