Trust Your Critic: Robust Reward Modeling and Reinforcement Learning for Faithful Image Editing and Generation

📅 2026-03-12

📈 Citations: 0

✨ Influential: 0

career value

236K/year

🤖 AI Summary

This work addresses the optimization bias in reinforcement learning for image generation and editing, which often arises from hallucinations and noisy reward signals in existing reward models. To mitigate these issues, the authors propose the FIRM framework, which leverages a high-quality, task-specific dataset to train an 8B-parameter reward model and introduces a novel “Base-and-Bonus” reward strategy. This strategy integrates Consistency-Modulated Execution (CME) and Quality-Modulated Alignment (QMA) mechanisms to enhance alignment with human preferences. Additionally, the study establishes FIRM-Bench, the first dual-path evaluation benchmark tailored for both image editing and generation. Experimental results demonstrate that FIRM-Edit-8B and FIRM-Gen-8B significantly outperform current methods in human preference alignment, while FIRM-Qwen-Edit and FIRM-SD3.5 markedly reduce hallucinations and improve instruction following and output fidelity.

Technology Category

Application Category

📝 Abstract

Reinforcement learning (RL) has emerged as a promising paradigm for enhancing image editing and text-to-image (T2I) generation. However, current reward models, which act as critics during RL, often suffer from hallucinations and assign noisy scores, inherently misguiding the optimization process. In this paper, we present FIRM (Faithful Image Reward Modeling), a comprehensive framework that develops robust reward models to provide accurate and reliable guidance for faithful image generation and editing. First, we design tailored data curation pipelines to construct high-quality scoring datasets. Specifically, we evaluate editing using both execution and consistency, while generation is primarily assessed via instruction following. Using these pipelines, we collect the FIRM-Edit-370K and FIRM-Gen-293K datasets, and train specialized reward models (FIRM-Edit-8B and FIRM-Gen-8B) that accurately reflect these criteria. Second, we introduce FIRM-Bench, a comprehensive benchmark specifically designed for editing and generation critics. Evaluations demonstrate that our models achieve superior alignment with human judgment compared to existing metrics. Furthermore, to seamlessly integrate these critics into the RL pipeline, we formulate a novel "Base-and-Bonus" reward strategy that balances competing objectives: Consistency-Modulated Execution (CME) for editing and Quality-Modulated Alignment (QMA) for generation. Empowered by this framework, our resulting models FIRM-Qwen-Edit and FIRM-SD3.5 achieve substantial performance breakthroughs. Comprehensive experiments demonstrate that FIRM mitigates hallucinations, establishing a new standard for fidelity and instruction adherence over existing general models. All of our datasets, models, and code have been publicly available at https://firm-reward.github.io.

Problem

Research questions and friction points this paper is trying to address.

reward modeling

hallucination

image editing

text-to-image generation

reinforcement learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reward Modeling

Reinforcement Learning

Faithful Image Generation