DreamOmni: Unified Image Generation and Editing

📅 2024-12-22
🏛️ Computer Vision and Pattern Recognition
📈 Citations: 7
Influential: 0
📄 PDF
🤖 AI Summary
Existing text-to-image (T2I) models achieve high-fidelity generation but lack unified modeling capabilities for downstream editing tasks—such as instruction-based and drag-based editing—and suffer from severe scarcity of high-quality, task-specific editing data. To address these limitations, we propose the first end-to-end unified framework jointly supporting T2I generation, instruction-driven editing, and drag-based editing. We design a sticker-based controllable synthetic data pipeline to alleviate the editing data bottleneck, and introduce joint contrastive distillation training alongside dual-modality editing modeling to harmonize generation and editing objectives. Extensive evaluations across multiple generation and editing benchmarks demonstrate that our method consistently outperforms specialized single-task models, achieving simultaneous improvements in generation quality and editing fidelity. The code and pretrained models will be publicly released.

Technology Category

Application Category

📝 Abstract
projectpagepCurrently, the success of large language models (LLMs) illustrates that a unified multitasking approach can significantly enhance model usability, stream-line deployment, and foster synergistic benefits across different tasks. However, in computer vision, while text-to-image (T2I) models have significantly improved generation quality through scaling up, their framework design did not initially consider how to unify with downstream tasks, such as various types of editing. To address this, we introduce DreamOmni, a unified model for image generation and editing. We begin by analyzing existing frameworks and the requirements of downstream tasks, proposing a unified framework that integrates both T2I models and various editing tasks. Furthermore, another key challenge is the efficient creation of high-quality editing data, particularly for instruction-based and drag-based editing. To this end, we develop a synthetic data pipeline using sticker-like elements to synthesize accurate, high-quality datasets efficiently, which enables editing data scaling up for unified model training. For training, DreamOmni jointly trains T2I generation and downstream tasks. T2I training enhances the model’s understanding of specific concepts and improves generation quality, while editing training helps the model grasp the nuances of the editing task. This collaboration significantly boosts editing performance. Extensive experiments confirm the effectiveness of DreamOmni. The code and model will be released.
Problem

Research questions and friction points this paper is trying to address.

Unifying image generation with diverse editing tasks
Creating efficient pipeline for high-quality editing data
Jointly training generation and editing for synergistic benefits
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified model for image generation and editing
Synthetic data pipeline using sticker-like elements
Joint training of generation and editing tasks
🔎 Similar Papers
No similar papers found.