Beyond Simple Edits: X-Planner for Complex Instruction-Based Image Editing

📅 2025-07-07

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Existing diffusion-based methods struggle with complex, indirect text instructions in text-guided image editing, suffering from identity loss, unintended edits, and reliance on hand-crafted masks. To address these limitations, we propose X-Planner—the first approach leveraging multimodal large language models (MLLMs) for explicit edit planning. Through chain-of-thought reasoning, X-Planner automatically decomposes intricate instructions into executable subtasks and jointly predicts edit types and pixel-accurate segmentation masks, enabling precise, localized, and identity-preserving edits. Crucially, it eliminates dependence on manual annotations by establishing an end-to-end automated data generation pipeline. Evaluated on both standard benchmarks and a newly constructed challenging editing benchmark, X-Planner achieves state-of-the-art performance in edit accuracy and identity fidelity, significantly outperforming prior methods.

Technology Category

Application Category

📝 Abstract

Recent diffusion-based image editing methods have significantly advanced text-guided tasks but often struggle to interpret complex, indirect instructions. Moreover, current models frequently suffer from poor identity preservation, unintended edits, or rely heavily on manual masks. To address these challenges, we introduce X-Planner, a Multimodal Large Language Model (MLLM)-based planning system that effectively bridges user intent with editing model capabilities. X-Planner employs chain-of-thought reasoning to systematically decompose complex instructions into simpler, clear sub-instructions. For each sub-instruction, X-Planner automatically generates precise edit types and segmentation masks, eliminating manual intervention and ensuring localized, identity-preserving edits. Additionally, we propose a novel automated pipeline for generating large-scale data to train X-Planner which achieves state-of-the-art results on both existing benchmarks and our newly introduced complex editing benchmark.

Problem

Research questions and friction points this paper is trying to address.

Interpreting complex indirect image editing instructions

Preserving image identity during automated edits

Reducing reliance on manual masks for editing

Innovation

Methods, ideas, or system contributions that make the work stand out.

MLLM-based planning system bridges intent and editing

Chain-of-thought reasoning decomposes complex instructions

Automated pipeline generates large-scale training data

🔎 Similar Papers

No similar papers found.