MIRA: Multimodal Iterative Reasoning Agent for Image Editing

📅 2025-11-26

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Existing diffusion-based image editing models struggle to accurately interpret complex natural language instructions involving compositional relationships, contextual dependencies, or demonstrative pronouns, leading to semantic drift and intent misalignment. To address this, we propose MIRA—a lightweight, plug-and-play multimodal reasoning agent that implements an iterative “Perceive–Reason–Act” loop, progressively decomposing user instructions into atomic editing operations to emulate human-like, multi-turn interactive understanding. Trained via two-stage supervised fine-tuning (SFT) and Generalized Reinforcement Learning from Preferences (GRPO) on the 150K-sample MIRA-Editing dataset, MIRA integrates seamlessly with open-source editors (e.g., Flux.1-Kontext, Step1X-Edit) to enable closed-loop feedback. Experiments demonstrate substantial improvements in semantic fidelity and visual quality; our method matches or surpasses closed-source systems—including GPT-Image and Nano-Banana—on multiple challenging instruction-following benchmarks.

Technology Category

Application Category

📝 Abstract

Instruction-guided image editing offers an intuitive way for users to edit images with natural language. However, diffusion-based editing models often struggle to accurately interpret complex user instructions, especially those involving compositional relationships, contextual cues, or referring expressions, leading to edits that drift semantically or fail to reflect the intended changes. We tackle this problem by proposing MIRA (Multimodal Iterative Reasoning Agent), a lightweight, plug-and-play multimodal reasoning agent that performs editing through an iterative perception-reasoning-action loop, effectively simulating multi-turn human-model interaction processes. Instead of issuing a single prompt or static plan, MIRA predicts atomic edit instructions step by step, using visual feedback to make its decisions. Our 150K multimodal tool-use dataset, MIRA-Editing, combined with a two-stage SFT + GRPO training pipeline, enables MIRA to perform reasoning and editing over complex editing instructions. When paired with open-source image editing models such as Flux.1-Kontext, Step1X-Edit, and Qwen-Image-Edit, MIRA significantly improves both semantic consistency and perceptual quality, achieving performance comparable to or exceeding proprietary systems such as GPT-Image and Nano-Banana.

Problem

Research questions and friction points this paper is trying to address.

Addresses inaccurate interpretation of complex image editing instructions

Solves semantic drift in diffusion-based editing models

Improves handling of compositional relationships and contextual cues

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal reasoning agent for iterative image editing

Step-by-step atomic edit instructions with visual feedback

Two-stage training pipeline with multimodal tool-use dataset

🔎 Similar Papers

EmoEdit: Evoking Emotions through Image Manipulation