🤖 AI Summary
To address inaccurate intent understanding and inconsistent editing outcomes under ambiguous or polysemous visual editing instructions—particularly misalignment between generated results and reference images/scenes—this paper proposes a reflective cross-modal reasoning framework. We introduce Reflection-Aware KL-Divergence Target Optimization (RKTO), the first method to align human intent with model preferences without requiring binary supervision. The framework integrates chain-of-thought (CoT) reasoning, RKTO-based optimization, and multimodal alignment modeling, trained on 30,000 instruction–output pairs annotated with human-provided rationale. Evaluated across image, video, 3D, and 4D editing tasks, our approach significantly improves instruction adherence and generation fidelity. Human evaluations confirm enhanced precision and contextual awareness in edited outputs. This work establishes a scalable, intent-driven paradigm for complex vision-language collaborative editing.
📝 Abstract
Editing complex visual content from ambiguous or partially specified instructions remains a core challenge in vision-language modeling. Existing models can contextualize content but often fail to infer the underlying intent within a reference image or scene, leading to inconsistent or misaligned edits. We introduce the Editing Vision-Language Model (EVLM), a system that interprets ambiguous instructions in conjunction with reference visuals to produce precise, context-aware editing prompts. EVLM's key innovation is a reflective reasoning framework that translates subjective user intent into structured, actionable outputs by aligning with human-rated rationales through Reflection-Aware KL-Divergence Target Optimization (RKTO). By combining Chain-of-Thought (CoT) reasoning with RKTO alignment, EVLM captures fine-grained editing preferences without relying on binary supervision. Trained on a dataset of 30,000 CoT examples with human-annotated rationale quality, EVLM achieves substantial gains in alignment with human intent. Experiments across image, video, 3D, and 4D editing tasks show that EVLM generates coherent and high-quality instructions, providing a scalable foundation for multimodal editing and reasoning.