Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction

📅 2025-05-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the modality gap between vision and language in natural multimodal interaction by proposing the first open-source unified multimodal framework that jointly models text-to-image generation and instruction-driven image editing. Methodologically, it introduces a novel multi-scale learnable token mechanism and a cross-modal representation alignment strategy, synergistically integrating a frozen large language model (LLM) with a trainable diffusion model. By incorporating MetaQueries and the M2-omni architecture, it constructs a unified visual generator. Unlike conventional unidirectional understanding-only paradigms, the framework breaks the generation–understanding dichotomy, achieving significant improvements in performance and interaction fluency across diverse multimodal interaction tasks. All code and model weights are publicly released to facilitate reproducibility and support AGI research.

Technology Category

Application Category

📝 Abstract
We introduce Ming-Lite-Uni, an open-source multimodal framework featuring a newly designed unified visual generator and a native multimodal autoregressive model tailored for unifying vision and language. Specifically, this project provides an open-source implementation of the integrated MetaQueries and M2-omni framework, while introducing the novel multi-scale learnable tokens and multi-scale representation alignment strategy. By leveraging a fixed MLLM and a learnable diffusion model, Ming-Lite-Uni enables native multimodal AR models to perform both text-to-image generation and instruction based image editing tasks, expanding their capabilities beyond pure visual understanding. Our experimental results demonstrate the strong performance of Ming-Lite-Uni and illustrate the impressive fluid nature of its interactive process. All code and model weights are open-sourced to foster further exploration within the community. Notably, this work aligns with concurrent multimodal AI milestones - such as ChatGPT-4o with native image generation updated in March 25, 2025 - underscoring the broader significance of unified models like Ming-Lite-Uni on the path toward AGI. Ming-Lite-Uni is in alpha stage and will soon be further refined.
Problem

Research questions and friction points this paper is trying to address.

Unifying vision and language with a multimodal framework
Enabling text-to-image generation and image editing tasks
Advancing open-source unified models toward AGI
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified visual generator and autoregressive model
Multi-scale learnable tokens and alignment strategy
Fixed MLLM and learnable diffusion model integration
🔎 Similar Papers
No similar papers found.
Biao Gong
Biao Gong
Ant Group | Alibaba Group
Generative ModelRetrieval3D Vision
C
Cheng Zou
D
Dandan Zheng
H
Hu Yu
J
Jingdong Chen
J
Jianxin Sun
J
Junbo Zhao
J
Jun Zhou
Kaixiang Ji
Kaixiang Ji
Ant Group
Computer VisionMultimodal
Lixiang Ru
Lixiang Ru
Ant Group
computer visionMLLMmulti-modal learningremote sensing
L
Libin Wang
Qingpei Guo
Qingpei Guo
Ant Group
Multimodal LLMsVision-Language Models
R
Rui Liu
W
Weilong Chai
X
Xinyu Xiao
Z
Ziyuan Huang