Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing

📅 2025-04-03

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

Current large multimodal models struggle with understanding complex instructions, preserving appearance consistency, and adapting to diverse input formats in general-purpose visual editing. To address this, we introduce RISEBench—the first benchmark explicitly designed for reasoning-enhanced visual editing—systematically covering temporal, causal, spatial, and logical reasoning capabilities. It evaluates models across three dimensions: instruction understanding, appearance consistency, and visual plausibility. We propose a novel human + LMM dual-track evaluation framework, integrating structured test cases and multidimensional metrics. Experiments reveal that even GPT-4o-Native—the top-performing model overall—achieves less than 40% accuracy on logical reasoning–driven editing tasks. RISEBench is publicly released to foster research on interpretable, reasoning-capable visual editing, establishing a new methodological paradigm for the field.

Technology Category

Application Category

📝 Abstract

Large Multi-modality Models (LMMs) have made significant progress in visual understanding and generation, but they still face challenges in General Visual Editing, particularly in following complex instructions, preserving appearance consistency, and supporting flexible input formats. To address this gap, we introduce RISEBench, the first benchmark for evaluating Reasoning-Informed viSual Editing (RISE). RISEBench focuses on four key reasoning types: Temporal, Causal, Spatial, and Logical Reasoning. We curate high-quality test cases for each category and propose an evaluation framework that assesses Instruction Reasoning, Appearance Consistency, and Visual Plausibility with both human judges and an LMM-as-a-judge approach. Our experiments reveal that while GPT-4o-Native significantly outperforms other open-source and proprietary models, even this state-of-the-art system struggles with logical reasoning tasks, highlighting an area that remains underexplored. As an initial effort, RISEBench aims to provide foundational insights into reasoning-aware visual editing and to catalyze future research. Though still in its early stages, we are committed to continuously expanding and refining the benchmark to support more comprehensive, reliable, and scalable evaluations of next-generation multimodal systems. Our code and data will be released at https://github.com/PhoenixZ810/RISEBench.

Problem

Research questions and friction points this paper is trying to address.

Evaluating Reasoning-Informed Visual Editing (RISE) in Large Multi-modality Models (LMMs)

Assessing four key reasoning types: Temporal, Causal, Spatial, and Logical Reasoning

Addressing challenges in complex instruction following, appearance consistency, and flexible input formats

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces RISEBench benchmark for visual editing

Evaluates four key reasoning types in editing

Uses human and LMM-as-a-judge evaluation

🔎 Similar Papers

Attribution Analysis Meets Model Editing: Advancing Knowledge Correction in Vision Language Models with VisEdit