🤖 AI Summary
Online reinforcement learning for image editing is hindered by the absence of fine-grained and reliable reward signals, with existing evaluators often suffering from “attention collapse” due to their neglect of cross-image comparisons and detail-aware perception. This work proposes the first explicit spatial reasoning–driven reward modeling approach, which aligns semantic judgments with spatial awareness by predicting edited regions and grounding them in pixel-level evidence. To support this framework, we construct a spatially aware training dataset comprising 260,000 samples and integrate it into an online reinforcement learning pipeline. Our method achieves state-of-the-art performance across multiple benchmarks, including MMRB2, EditReward-Bench, and MultiEditReward-Bench. When deployed as a reward signal, it improves OmniGen2 by 0.90 points on GEdit-Bench—yielding gains twice as large as those achieved by leading discriminative models and GPT-4.1.
📝 Abstract
Online Reinforcement Learning (RL) offers a promising avenue for complex image editing but is currently constrained by the scarcity of reliable and fine-grained reward signals. Existing evaluators frequently struggle with a critical perception gap we term"Attention Collapse,"where models neglect cross-image comparisons and fail to capture fine-grained details, resulting in inaccurate perception and miscalibrated scores. To address these limitations, we propose SpatialReward, a reward model that enforces precise verification via explicit spatial reasoning. By anchoring reasoning to predicted edit regions, SpatialReward grounds semantic judgments in pixel-level evidence, significantly enhancing evaluative accuracy. Trained on a curated 260k spatial-aware dataset, our model achieves state-of-the-art performance on MMRB2 and EditReward-Bench, and outperforms proprietary evaluators on our proposed MultiEditReward-Bench. Furthermore, SpatialReward serves as a robust signal in online RL, boosting OmniGen2 by +0.90 on GEdit-Bench--surpassing the leading discriminative model and doubling the gain of GPT-4.1 (+0.45). These results demonstrate that spatial reasoning is essential for unlocking effective alignment in image editing.