OmniGen2: Exploration to Advanced Multimodal Generation

📅 2025-06-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of a unified modeling framework for multimodal generation tasks. We propose a dual-decoding-path unified generative model featuring independent text and image decoders, a decoupled image tokenizer, and non-shared parameters—enabling native support for text-to-image synthesis, image editing, and in-context generation while preserving strong text generation capability. A novel image generation reflection mechanism is introduced to enhance iterative refinement, and we construct a high-quality, task-specific multimodal dataset along with a dedicated training pipeline. Experiments demonstrate state-of-the-art performance on both text-to-image and image editing benchmarks. Notably, our model achieves SOTA results on the OmniContext consistency evaluation among open-source models, significantly improving cross-task generalization and cross-modal synergy.

Technology Category

Application Category

📝 Abstract
In this work, we introduce OmniGen2, a versatile and open-source generative model designed to provide a unified solution for diverse generation tasks, including text-to-image, image editing, and in-context generation. Unlike OmniGen v1, OmniGen2 features two distinct decoding pathways for text and image modalities, utilizing unshared parameters and a decoupled image tokenizer. This design enables OmniGen2 to build upon existing multimodal understanding models without the need to re-adapt VAE inputs, thereby preserving the original text generation capabilities. To facilitate the training of OmniGen2, we developed comprehensive data construction pipelines, encompassing image editing and in-context generation data. Additionally, we introduce a reflection mechanism tailored for image generation tasks and curate a dedicated reflection dataset based on OmniGen2. Despite its relatively modest parameter size, OmniGen2 achieves competitive results on multiple task benchmarks, including text-to-image and image editing. To further evaluate in-context generation, also referred to as subject-driven tasks, we introduce a new benchmark named OmniContext. OmniGen2 achieves state-of-the-art performance among open-source models in terms of consistency. We will release our models, training code, datasets, and data construction pipeline to support future research in this field. Project Page: https://vectorspacelab.github.io/OmniGen2; GitHub Link: https://github.com/VectorSpaceLab/OmniGen2
Problem

Research questions and friction points this paper is trying to address.

Unified solution for diverse multimodal generation tasks
Enhances text-to-image and image editing capabilities
Introduces new benchmark for in-context generation evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual decoding pathways for text and image
Decoupled image tokenizer with unshared parameters
Reflection mechanism for image generation tasks
🔎 Similar Papers
No similar papers found.