An LLM-LVLM Driven Agent for Iterative and Fine-Grained Image Editing

📅 2025-08-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing text-to-image editing methods struggle with fine-grained, multi-turn iterative editing due to coarse instruction understanding, context drift, and the absence of intelligent visual feedback mechanisms. To address this, we propose RefineEdit-Agent—a training-free framework that introduces the first LLM–LVLM collaborative closed-loop editing paradigm: the LLM handles instruction parsing and hierarchical planning, while the LVLM performs scene understanding, edit execution, and visual feedback assessment. By synergistically integrating linguistic reasoning and visual perception, our approach achieves high-fidelity, context-preserving iterative refinement. Evaluated on our newly constructed LongBench-T2I-Edit benchmark, RefineEdit-Agent scores 3.67—significantly surpassing state-of-the-art methods—and achieves superior performance in complex instruction comprehension, edit fidelity, and cross-turn contextual consistency.

Technology Category

Application Category

📝 Abstract
Despite the remarkable capabilities of text-to-image (T2I) generation models, real-world applications often demand fine-grained, iterative image editing that existing methods struggle to provide. Key challenges include granular instruction understanding, robust context preservation during modifications, and the lack of intelligent feedback mechanisms for iterative refinement. This paper introduces RefineEdit-Agent, a novel, training-free intelligent agent framework designed to address these limitations by enabling complex, iterative, and context-aware image editing. RefineEdit-Agent leverages the powerful planning capabilities of Large Language Models (LLMs) and the advanced visual understanding and evaluation prowess of Vision-Language Large Models (LVLMs) within a closed-loop system. Our framework comprises an LVLM-driven instruction parser and scene understanding module, a multi-level LLM-driven editing planner for goal decomposition, tool selection, and sequence generation, an iterative image editing module, and a crucial LVLM-driven feedback and evaluation loop. To rigorously evaluate RefineEdit-Agent, we propose LongBench-T2I-Edit, a new benchmark featuring 500 initial images with complex, multi-turn editing instructions across nine visual dimensions. Extensive experiments demonstrate that RefineEdit-Agent significantly outperforms state-of-the-art baselines, achieving an average score of 3.67 on LongBench-T2I-Edit, compared to 2.29 for Direct Re-Prompting, 2.91 for InstructPix2Pix, 3.16 for GLIGEN-based Edit, and 3.39 for ControlNet-XL. Ablation studies, human evaluations, and analyses of iterative refinement, backbone choices, tool usage, and robustness to instruction complexity further validate the efficacy of our agentic design in delivering superior edit fidelity and context preservation.
Problem

Research questions and friction points this paper is trying to address.

Enabling fine-grained iterative image editing with intelligent feedback
Addressing granular instruction understanding and context preservation challenges
Overcoming limitations in robust iterative refinement mechanisms
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-LVLM closed-loop system for iterative editing
Multi-level planning with goal decomposition tool selection
LVLM-driven feedback evaluation loop for refinement
🔎 Similar Papers
No similar papers found.
Z
Zihan Liang
Kunming University of Science and Technology
J
Jiahao Sun
Kunming University of Science and Technology
Haoran Ma
Haoran Ma
PhD Student, University of California, Los Angeles
Computer SystemsSoftware Engineering