RetouchIQ: MLLM Agents for Instruction-Based Image Retouching with Generalist Reward

📅 2026-02-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing approaches struggle to provide reliable and verifiable reward signals to guide multimodal large language models (MLLMs) in performing expert-level image editing tasks aligned with user instructions, particularly due to the subjective nature of creative edits. This work proposes the first reinforcement learning framework based on a general-purpose reward model, wherein an MLLM agent translates high-level semantic instructions into precise parameter adjustments within professional photo-editing software. The reward model dynamically generates image-specific, multimodal evaluation metrics, delivering interpretable scalar feedback. Evaluated on a newly curated dataset of 190,000 instruction–reasoning pairs, the proposed method significantly outperforms existing MLLMs and diffusion models in both semantic fidelity and perceptual quality, achieving executable and verifiable high-quality image editing.

Technology Category

Application Category

📝 Abstract
Recent advances in multimodal large language models (MLLMs) have shown great potential for extending vision-language reasoning to professional tool-based image editing, enabling intuitive and creative editing. A promising direction is to use reinforcement learning (RL) to enable MLLMs to reason about and execute optimal tool-use plans within professional image-editing software. However, training remains challenging due to the lack of reliable, verifiable reward signals that can reflect the inherently subjective nature of creative editing. In this work, we introduce RetouchIQ, a framework that performs instruction-based executable image editing through MLLM agents guided by a generalist reward model. RetouchIQ interprets user-specified editing intentions and generates corresponding, executable image adjustments, bridging high-level aesthetic goals with precise parameter control. To move beyond conventional, rule-based rewards that compute similarity against a fixed reference image using handcrafted metrics, we propose a generalist reward model, an RL fine-tuned MLLM that evaluates edited results through a set of generated metrics on a case-by-case basis. Then, the reward model provides scalar feedback through multimodal reasoning, enabling reinforcement learning with high-quality, instruction-consistent gradients. We curate an extended dataset with 190k instruction-reasoning pairs and establish a new benchmark for instruction-based image editing. Experiments show that RetouchIQ substantially improves both semantic consistency and perceptual quality over previous MLLM-based and diffusion-based editing systems. Our findings demonstrate the potential of generalist reward-driven MLLM agents as flexible, explainable, and executable assistants for professional image editing.
Problem

Research questions and friction points this paper is trying to address.

image retouching
instruction-based editing
reward signal
multimodal large language models
subjective evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Large Language Models
Reinforcement Learning
Generalist Reward Model
Instruction-Based Image Editing
Executable Image Retouching
🔎 Similar Papers
No similar papers found.