RePlan: Reasoning-guided Region Planning for Complex Instruction-based Image Editing

📅 2025-12-18

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Existing instruction-driven image editing models suffer from performance degradation under instruction-visual complexity (IV-Complexity), where intricate instructions couple with cluttered or low-fidelity images, hindering precise target localization and edit fidelity. To address this, we propose a “Plan–Execute” two-stage framework: first, a vision-language planner performs stepwise reasoning and explicit region localization; second, a training-free attention-region injection mechanism enables parallel multi-region editing. Our contributions include: (1) the first region-aligned reasoning and planning paradigm for instruction grounding; (2) GRPO-based reinforcement learning to enhance robustness in instruction parsing; and (3) IV-Edit—the first benchmark targeting fine-grained grounding and knowledge-intensive editing. Experiments demonstrate that our method significantly outperforms strong baselines under IV-Complex conditions, achieving substantial gains in both region localization accuracy and edit fidelity.

Technology Category

Application Category

📝 Abstract

Instruction-based image editing enables natural-language control over visual modifications, yet existing models falter under Instruction-Visual Complexity (IV-Complexity), where intricate instructions meet cluttered or ambiguous scenes. We introduce RePlan (Region-aligned Planning), a plan-then-execute framework that couples a vision-language planner with a diffusion editor. The planner decomposes instructions via step-by-step reasoning and explicitly grounds them to target regions; the editor then applies changes using a training-free attention-region injection mechanism, enabling precise, parallel multi-region edits without iterative inpainting. To strengthen planning, we apply GRPO-based reinforcement learning using 1K instruction-only examples, yielding substantial gains in reasoning fidelity and format reliability. We further present IV-Edit, a benchmark focused on fine-grained grounding and knowledge-intensive edits. Across IV-Complex settings, RePlan consistently outperforms strong baselines trained on far larger datasets, improving regional precision and overall fidelity. Our project page: https://replan-iv-edit.github.io

Problem

Research questions and friction points this paper is trying to address.

Enables precise multi-region image editing from complex instructions

Addresses instruction-visual complexity in cluttered or ambiguous scenes

Improves regional precision and fidelity without iterative inpainting

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-language planner decomposes instructions via reasoning

Training-free attention-region injection enables precise multi-region edits

GRPO-based reinforcement learning enhances reasoning fidelity

🔎 Similar Papers

No similar papers found.