Trust Your Critic: Robust Reward Modeling and Reinforcement Learning for Faithful Image Editing and Generation

πŸ“… 2026-03-12
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the optimization bias in reinforcement learning for image generation and editing, which often arises from hallucinations and noisy reward signals in existing reward models. To mitigate these issues, the authors propose the FIRM framework, which leverages a high-quality, task-specific dataset to train an 8B-parameter reward model and introduces a novel β€œBase-and-Bonus” reward strategy. This strategy integrates Consistency-Modulated Execution (CME) and Quality-Modulated Alignment (QMA) mechanisms to enhance alignment with human preferences. Additionally, the study establishes FIRM-Bench, the first dual-path evaluation benchmark tailored for both image editing and generation. Experimental results demonstrate that FIRM-Edit-8B and FIRM-Gen-8B significantly outperform current methods in human preference alignment, while FIRM-Qwen-Edit and FIRM-SD3.5 markedly reduce hallucinations and improve instruction following and output fidelity.

Technology Category

Application Category

πŸ“ Abstract
Reinforcement learning (RL) has emerged as a promising paradigm for enhancing image editing and text-to-image (T2I) generation. However, current reward models, which act as critics during RL, often suffer from hallucinations and assign noisy scores, inherently misguiding the optimization process. In this paper, we present FIRM (Faithful Image Reward Modeling), a comprehensive framework that develops robust reward models to provide accurate and reliable guidance for faithful image generation and editing. First, we design tailored data curation pipelines to construct high-quality scoring datasets. Specifically, we evaluate editing using both execution and consistency, while generation is primarily assessed via instruction following. Using these pipelines, we collect the FIRM-Edit-370K and FIRM-Gen-293K datasets, and train specialized reward models (FIRM-Edit-8B and FIRM-Gen-8B) that accurately reflect these criteria. Second, we introduce FIRM-Bench, a comprehensive benchmark specifically designed for editing and generation critics. Evaluations demonstrate that our models achieve superior alignment with human judgment compared to existing metrics. Furthermore, to seamlessly integrate these critics into the RL pipeline, we formulate a novel "Base-and-Bonus" reward strategy that balances competing objectives: Consistency-Modulated Execution (CME) for editing and Quality-Modulated Alignment (QMA) for generation. Empowered by this framework, our resulting models FIRM-Qwen-Edit and FIRM-SD3.5 achieve substantial performance breakthroughs. Comprehensive experiments demonstrate that FIRM mitigates hallucinations, establishing a new standard for fidelity and instruction adherence over existing general models. All of our datasets, models, and code have been publicly available at https://firm-reward.github.io.
Problem

Research questions and friction points this paper is trying to address.

reward modeling
hallucination
image editing
text-to-image generation
reinforcement learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reward Modeling
Reinforcement Learning
Faithful Image Generation
Instruction Following
Hallucination Mitigation
πŸ”Ž Similar Papers
No similar papers found.