Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling

📅 2026-05-13

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

Existing benchmarks for image editing and reward model evaluation suffer from oversimplified tasks, coarse assessments, and limited relevance to real-world reinforcement learning (RL) scenarios. To address these limitations, this work proposes a unified evaluation framework featuring six progressively complex editing tasks and introduces a fine-grained, multidimensional human evaluation protocol. For the first time, it integrates image editing and reward modeling within a single benchmark. By employing a structured reasoning-based scoring mechanism and a preference-pair construction method that simulates authentic RL training dynamics, the study releases Edit-Compass—a dataset of 2,388 edited samples—and EditReward-Compass, comprising 2,251 preference pairs. This benchmark substantially enhances evaluation fidelity, discriminative power, and practical utility for both image editing and reward modeling research.

📝 Abstract

Recent image editing models have achieved remarkable progress in instruction following, multimodal understanding, and complex visual editing. However, existing benchmarks often fail to faithfully reflect human judgment, especially for strong frontier models, due to limited task difficulty and coarse-grained evaluation protocols. In parallel, reward models have become increasingly important for RL-based image editing optimization, yet existing reward model benchmarks still rely on unrealistic evaluation settings that deviate from practical RL scenarios. These limitations hinder reliable assessment of both image editing models and reward models. To address these challenges, we introduce Edit-Compass and EditReward-Compass, a unified evaluation suite for image editing and reward modeling. Edit-Compass contains 2,388 carefully annotated instances spanning six progressively challenging task categories, covering capabilities such as world knowledge reasoning, visual reasoning, and multi-image editing. Beyond broad task coverage, Edit-Compass adopts a fine-grained multidimensional evaluation framework based on structured reasoning and carefully designed scoring rubrics. In parallel, EditReward-Compass contains 2,251 preference pairs that simulate realistic reward modeling scenarios during RL optimization.

Problem

Research questions and friction points this paper is trying to address.

image editing

reward modeling

evaluation benchmark

reinforcement learning

human judgment

Innovation

Methods, ideas, or system contributions that make the work stand out.

image editing benchmark

reward modeling

fine-grained evaluation