Understanding the Implicit User Intention via Reasoning with Large Language Model for Image Editing

📅 2025-10-31

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

To address the challenge of inferring users’ implicit intentions from complex image editing instructions, this paper proposes a novel paradigm that avoids joint fine-tuning of large language models (LLMs) and diffusion models. Methodologically, it leverages an LLM for multi-step reasoning to decompose ambiguous instructions into explicit, executable fine-grained editing operations; introduces an iterative update mechanism to construct a dynamic scene representation; and establishes CIEBench—the first benchmark tailored for reasoning-driven image editing. The core contribution lies in decoupling semantic understanding from generation: foundational models extract structural image representations, while the LLM performs intent reasoning to enable precise editing. Experiments demonstrate state-of-the-art performance: PSNR improves by 9.955 dB over prior art on SmartEdit, and the method achieves significant gains across all metrics on CIEBench.

Technology Category

Application Category

📝 Abstract

Existing image editing methods can handle simple editing instructions very well. To deal with complex editing instructions, they often need to jointly fine-tune the large language models (LLMs) and diffusion models (DMs), which involves very high computational complexity and training cost. To address this issue, we propose a new method, called extbf{C}omplex extbf{I}mage extbf{E}diting via extbf{L}LM extbf{R}easoning (CIELR), which converts a complex user instruction into a set of simple and explicit editing actions, eliminating the need for jointly fine-tuning the large language models and diffusion models. Specifically, we first construct a structured semantic representation of the input image using foundation models. Then, we introduce an iterative update mechanism that can progressively refine this representation, obtaining a fine-grained visual representation of the image scene. This allows us to perform complex and flexible image editing tasks. Extensive experiments on the SmartEdit Reasoning Scenario Set show that our method surpasses the previous state-of-the-art by 9.955 dB in PSNR, indicating its superior preservation of regions that should remain consistent. Due to the limited number of samples of public datasets of complex image editing with reasoning, we construct a benchmark named CIEBench, containing 86 image samples, together with a metric specifically for reasoning-based image editing. CIELR also outperforms previous methods on this benchmark. The code and dataset are available at href{https://github.com/Jia-shao/Reasoning-Editing}{https://github.com/Jia-shao/Reasoning-Editing}.

Problem

Research questions and friction points this paper is trying to address.

Converting complex user instructions into simple editing actions

Eliminating joint fine-tuning of LLMs and diffusion models

Enabling flexible image editing through iterative semantic refinement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses LLM reasoning to convert complex instructions into actions

Constructs structured semantic representation using foundation models

Introduces iterative mechanism for fine-grained visual representation

🔎 Similar Papers

No similar papers found.