🤖 AI Summary
This work introduces the first open-source unified multimodal foundation model, addressing key limitations of existing approaches—including reliance on multiple specialized models, task-specific fine-tuning, or architectural reconfiguration. Methodologically, it proposes a novel modality-specific Mixture-of-Experts (MoE) router and the Ling backbone architecture, integrating dedicated modality encoders, the Ming-Lite-Uni image generator, and an end-to-end audio decoder to enable joint perception and generation across text, images, audio, and video. The model supports cross-modal understanding, contextual multimodal dialogue, text-to-speech (TTS), high-fidelity image editing, and speech synthesis—achieving GPT-4o–level performance in modality coverage and generative capability. All code and model weights are publicly released, establishing a foundational infrastructure and technical paradigm for research on unified multimodal foundation models.
📝 Abstract
We propose Ming-Omni, a unified multimodal model capable of processing images, text, audio, and video, while demonstrating strong proficiency in both speech and image generation. Ming-Omni employs dedicated encoders to extract tokens from different modalities, which are then processed by Ling, an MoE architecture equipped with newly proposed modality-specific routers. This design enables a single model to efficiently process and fuse multimodal inputs within a unified framework, thereby facilitating diverse tasks without requiring separate models, task-specific fine-tuning, or structural redesign. Importantly, Ming-Omni extends beyond conventional multimodal models by supporting audio and image generation. This is achieved through the integration of an advanced audio decoder for natural-sounding speech and Ming-Lite-Uni for high-quality image generation, which also allow the model to engage in context-aware chatting, perform text-to-speech conversion, and conduct versatile image editing. Our experimental results showcase Ming-Omni offers a powerful solution for unified perception and generation across all modalities. Notably, our proposed Ming-Omni is the first open-source model we are aware of to match GPT-4o in modality support, and we release all code and model weights to encourage further research and development in the community.