The Promise of RL for Autoregressive Image Editing

📅 2025-08-01

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

Autoregressive image editing models suffer from poor generalization across diverse tasks and heavy reliance on large-scale annotated data. To address this, we propose EARL—a unified autoregressive editing framework integrating reinforcement learning (RL) with a large multimodal language model (MLLM) verifier. EARL jointly encodes textual instructions and image tokens, and employs coordinated training via supervised fine-tuning, RL optimization, and chain-of-thought reasoning—achieving high-fidelity, fine-grained editing with minimal labeled data. Experiments demonstrate that EARL significantly outperforms strong baselines (e.g., InstructPix2Pix, MagicBrush) across text-driven editing, object replacement, and attribute modification tasks. It improves edit fidelity and instruction adherence by 12.6% and 9.3%, respectively, while reducing required training data by approximately 75%. The code and models are publicly released.

Technology Category

Application Category

📝 Abstract

We explore three strategies to enhance performance on a wide range of image editing tasks: supervised fine-tuning (SFT), reinforcement learning (RL), and Chain-of-Thought (CoT) reasoning. In order to study all these components in one consistent framework, we adopt an autoregressive multimodal model that processes textual and visual tokens in a unified manner. We find RL combined with a large multi-modal LLM verifier to be the most effective of these strategies. As a result, we release EARL: Editing with Autoregression and RL, a strong RL-based image editing model that performs competitively on a diverse range of edits compared to strong baselines, despite using much less training data. Thus, EARL pushes the frontier of autoregressive multimodal models on image editing. We release our code, training data, and trained models at https://github.com/mair-lab/EARL.

Problem

Research questions and friction points this paper is trying to address.

Enhancing image editing with reinforcement learning and autoregression

Combining RL and multimodal LLM for effective image edits

Developing EARL for competitive performance with less training data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses reinforcement learning for image editing

Combines RL with multimodal LLM verifier

Autoregressive model processes text and visuals

🔎 Similar Papers

No similar papers found.