🤖 AI Summary
This paper addresses the modality gap between vision and language in natural multimodal interaction by proposing the first open-source unified multimodal framework that jointly models text-to-image generation and instruction-driven image editing. Methodologically, it introduces a novel multi-scale learnable token mechanism and a cross-modal representation alignment strategy, synergistically integrating a frozen large language model (LLM) with a trainable diffusion model. By incorporating MetaQueries and the M2-omni architecture, it constructs a unified visual generator. Unlike conventional unidirectional understanding-only paradigms, the framework breaks the generation–understanding dichotomy, achieving significant improvements in performance and interaction fluency across diverse multimodal interaction tasks. All code and model weights are publicly released to facilitate reproducibility and support AGI research.
📝 Abstract
We introduce Ming-Lite-Uni, an open-source multimodal framework featuring a newly designed unified visual generator and a native multimodal autoregressive model tailored for unifying vision and language. Specifically, this project provides an open-source implementation of the integrated MetaQueries and M2-omni framework, while introducing the novel multi-scale learnable tokens and multi-scale representation alignment strategy. By leveraging a fixed MLLM and a learnable diffusion model, Ming-Lite-Uni enables native multimodal AR models to perform both text-to-image generation and instruction based image editing tasks, expanding their capabilities beyond pure visual understanding. Our experimental results demonstrate the strong performance of Ming-Lite-Uni and illustrate the impressive fluid nature of its interactive process. All code and model weights are open-sourced to foster further exploration within the community. Notably, this work aligns with concurrent multimodal AI milestones - such as ChatGPT-4o with native image generation updated in March 25, 2025 - underscoring the broader significance of unified models like Ming-Lite-Uni on the path toward AGI. Ming-Lite-Uni is in alpha stage and will soon be further refined.