OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation

📅 2025-05-29

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

To address the challenge of unifying multimodal understanding and generation within a single framework, this paper introduces OpenUni—the first fully open-source, minimalist unified architecture. Its core design modularly couples off-the-shelf multimodal large language models (MLLMs) and diffusion models via learnable queries and a lightweight Transformer connector, activating only 1.1B or 3.1B parameters. Crucially, OpenUni avoids end-to-end training, drastically reducing computational overhead while enabling strong cross-modal synergy. Experiments demonstrate state-of-the-art performance on instruction-aligned image generation tasks and top-tier results across multiple benchmarks—including GenEval, DPG-Bench, and WISE. To foster reproducibility and community advancement, the project releases all model weights, training code, and a high-quality dataset comprising 23 million image-text pairs.

Technology Category

Application Category

📝 Abstract

In this report, we present OpenUni, a simple, lightweight, and fully open-source baseline for unifying multimodal understanding and generation. Inspired by prevailing practices in unified model learning, we adopt an efficient training strategy that minimizes the training complexity and overhead by bridging the off-the-shelf multimodal large language models (LLMs) and diffusion models through a set of learnable queries and a light-weight transformer-based connector. With a minimalist choice of architecture, we demonstrate that OpenUni can: 1) generate high-quality and instruction-aligned images, and 2) achieve exceptional performance on standard benchmarks such as GenEval, DPG- Bench, and WISE, with only 1.1B and 3.1B activated parameters. To support open research and community advancement, we release all model weights, training code, and our curated training datasets (including 23M image-text pairs) at https://github.com/wusize/OpenUni.

Problem

Research questions and friction points this paper is trying to address.

Unifying multimodal understanding and generation tasks

Minimizing training complexity with lightweight architecture

Achieving high-quality image generation and benchmark performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight transformer-based connector for multimodal models

Efficient training strategy minimizing complexity

Open-source release of models and datasets

🔎 Similar Papers

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation