Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction

📅 2025-05-05

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This paper addresses the modality gap between vision and language in natural multimodal interaction by proposing the first open-source unified multimodal framework that jointly models text-to-image generation and instruction-driven image editing. Methodologically, it introduces a novel multi-scale learnable token mechanism and a cross-modal representation alignment strategy, synergistically integrating a frozen large language model (LLM) with a trainable diffusion model. By incorporating MetaQueries and the M2-omni architecture, it constructs a unified visual generator. Unlike conventional unidirectional understanding-only paradigms, the framework breaks the generation–understanding dichotomy, achieving significant improvements in performance and interaction fluency across diverse multimodal interaction tasks. All code and model weights are publicly released to facilitate reproducibility and support AGI research.

Technology Category

Application Category

📝 Abstract

We introduce Ming-Lite-Uni, an open-source multimodal framework featuring a newly designed unified visual generator and a native multimodal autoregressive model tailored for unifying vision and language. Specifically, this project provides an open-source implementation of the integrated MetaQueries and M2-omni framework, while introducing the novel multi-scale learnable tokens and multi-scale representation alignment strategy. By leveraging a fixed MLLM and a learnable diffusion model, Ming-Lite-Uni enables native multimodal AR models to perform both text-to-image generation and instruction based image editing tasks, expanding their capabilities beyond pure visual understanding. Our experimental results demonstrate the strong performance of Ming-Lite-Uni and illustrate the impressive fluid nature of its interactive process. All code and model weights are open-sourced to foster further exploration within the community. Notably, this work aligns with concurrent multimodal AI milestones - such as ChatGPT-4o with native image generation updated in March 25, 2025 - underscoring the broader significance of unified models like Ming-Lite-Uni on the path toward AGI. Ming-Lite-Uni is in alpha stage and will soon be further refined.

Problem

Research questions and friction points this paper is trying to address.

Unifying vision and language with a multimodal framework

Enabling text-to-image generation and image editing tasks

Advancing open-source unified models toward AGI

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified visual generator and autoregressive model

Multi-scale learnable tokens and alignment strategy

Fixed MLLM and learnable diffusion model integration

🔎 Similar Papers

No similar papers found.