🤖 AI Summary
To address the challenge of unifying multimodal understanding and generation within a single framework, this paper introduces OpenUni—the first fully open-source, minimalist unified architecture. Its core design modularly couples off-the-shelf multimodal large language models (MLLMs) and diffusion models via learnable queries and a lightweight Transformer connector, activating only 1.1B or 3.1B parameters. Crucially, OpenUni avoids end-to-end training, drastically reducing computational overhead while enabling strong cross-modal synergy. Experiments demonstrate state-of-the-art performance on instruction-aligned image generation tasks and top-tier results across multiple benchmarks—including GenEval, DPG-Bench, and WISE. To foster reproducibility and community advancement, the project releases all model weights, training code, and a high-quality dataset comprising 23 million image-text pairs.
📝 Abstract
In this report, we present OpenUni, a simple, lightweight, and fully open-source baseline for unifying multimodal understanding and generation. Inspired by prevailing practices in unified model learning, we adopt an efficient training strategy that minimizes the training complexity and overhead by bridging the off-the-shelf multimodal large language models (LLMs) and diffusion models through a set of learnable queries and a light-weight transformer-based connector. With a minimalist choice of architecture, we demonstrate that OpenUni can: 1) generate high-quality and instruction-aligned images, and 2) achieve exceptional performance on standard benchmarks such as GenEval, DPG- Bench, and WISE, with only 1.1B and 3.1B activated parameters. To support open research and community advancement, we release all model weights, training code, and our curated training datasets (including 23M image-text pairs) at https://github.com/wusize/OpenUni.