DreamOmni2: Multimodal Instruction-based Editing and Generation

📅 2025-10-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing image editing and generation methods suffer from two key limitations: text-only instruction-based editing struggles to model fine-grained modifications precisely, while subject-driven approaches are confined to concrete entities and cannot handle abstract concepts. This paper introduces multimodal instruction editing and generation—a novel task enabling joint text-and-image inputs to unify modeling of both concrete objects and abstract semantics. Our contributions include: (1) the first multimodal instruction framework supporting abstract concepts; (2) an index encoding and positional offset mechanism to effectively disambiguate multiple input images; and (3) a feature-mixing strategy coupled with an edit-extract model to synthesize high-quality training data, jointly optimized with a vision-language model (VLM). Evaluated on a newly constructed comprehensive benchmark, DreamOmni2 achieves significant improvements in complex instruction comprehension and multimodal editing/generation fidelity.

Technology Category

Application Category

📝 Abstract
Recent advancements in instruction-based image editing and subject-driven generation have garnered significant attention, yet both tasks still face limitations in meeting practical user needs. Instruction-based editing relies solely on language instructions, which often fail to capture specific editing details, making reference images necessary. Meanwhile, subject-driven generation is limited to combining concrete objects or people, overlooking broader, abstract concepts. To address these challenges, we propose two novel tasks: multimodal instruction-based editing and generation. These tasks support both text and image instructions and extend the scope to include both concrete and abstract concepts, greatly enhancing their practical applications. We introduce DreamOmni2, tackling two primary challenges: data creation and model framework design. Our data synthesis pipeline consists of three steps: (1) using a feature mixing method to create extraction data for both abstract and concrete concepts, (2) generating multimodal instruction-based editing training data using the editing and extraction models, and (3) further applying the extraction model to create training data for multimodal instruction-based editing. For the framework, to handle multi-image input, we propose an index encoding and position encoding shift scheme, which helps the model distinguish images and avoid pixel confusion. Additionally, we introduce joint training with the VLM and our generation/editing model to better process complex instructions. In addition, we have proposed comprehensive benchmarks for these two new tasks to drive their development. Experiments show that DreamOmni2 has achieved impressive results. Models and codes will be released.
Problem

Research questions and friction points this paper is trying to address.

Addresses limitations in instruction-based image editing with multimodal inputs
Extends subject-driven generation to include abstract concepts
Proposes data synthesis and model framework for multimodal editing tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Feature mixing method for concept extraction data
Index and position encoding shift for multi-image input
Joint training with VLM and generation model
🔎 Similar Papers
No similar papers found.