OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation

📅 2025-05-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of unifying multimodal understanding and generation within a single framework, this paper introduces OpenUni—the first fully open-source, minimalist unified architecture. Its core design modularly couples off-the-shelf multimodal large language models (MLLMs) and diffusion models via learnable queries and a lightweight Transformer connector, activating only 1.1B or 3.1B parameters. Crucially, OpenUni avoids end-to-end training, drastically reducing computational overhead while enabling strong cross-modal synergy. Experiments demonstrate state-of-the-art performance on instruction-aligned image generation tasks and top-tier results across multiple benchmarks—including GenEval, DPG-Bench, and WISE. To foster reproducibility and community advancement, the project releases all model weights, training code, and a high-quality dataset comprising 23 million image-text pairs.

Technology Category

Application Category

📝 Abstract
In this report, we present OpenUni, a simple, lightweight, and fully open-source baseline for unifying multimodal understanding and generation. Inspired by prevailing practices in unified model learning, we adopt an efficient training strategy that minimizes the training complexity and overhead by bridging the off-the-shelf multimodal large language models (LLMs) and diffusion models through a set of learnable queries and a light-weight transformer-based connector. With a minimalist choice of architecture, we demonstrate that OpenUni can: 1) generate high-quality and instruction-aligned images, and 2) achieve exceptional performance on standard benchmarks such as GenEval, DPG- Bench, and WISE, with only 1.1B and 3.1B activated parameters. To support open research and community advancement, we release all model weights, training code, and our curated training datasets (including 23M image-text pairs) at https://github.com/wusize/OpenUni.
Problem

Research questions and friction points this paper is trying to address.

Unifying multimodal understanding and generation tasks
Minimizing training complexity with lightweight architecture
Achieving high-quality image generation and benchmark performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight transformer-based connector for multimodal models
Efficient training strategy minimizing complexity
Open-source release of models and datasets
🔎 Similar Papers
2024-08-22International Conference on Learning RepresentationsCitations: 292
Size Wu
Size Wu
Nanyang Technological University
computer vision
Zhonghua Wu
Zhonghua Wu
SenseTime Research
Computer VisionDeep Learning
Z
Zerui Gong
S-Lab, Nanyang Technological University
Q
Qi Tao
SenseTime Research
S
Sheng Jin
SenseTime Research and Tetras.AI
Q
Qinyue Li
SenseTime Research
W
Wei Li
S-Lab, Nanyang Technological University
Chen Change Loy
Chen Change Loy
President's Chair Professor, MMLab@NTU, S-Lab, Nanyang Technological University
Computer VisionImage ProcessingMachine Learning