ReViSE: Towards Reason-Informed Video Editing in Unified Models with Self-Reflective Learning

📅 2025-12-10

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

Existing unified video models exhibit strong understanding and generation capabilities but struggle with physics- and causality-aware visual editing, primarily due to the absence of reasoning-oriented evaluation benchmarks and an inherent decoupling between model understanding and editing functionality. Method: We introduce Reason-Informed Video Editing (RVE), a novel task requiring explicit modeling of physical plausibility and causal dynamics during editing. We construct RVE-Bench—the first dedicated benchmark—and propose the Self-Reflective Reasoning (SRF) framework, which integrates an internal vision-language model (VLM) as a reasoning evaluator to provide closed-loop feedback for generator refinement. Contribution/Results: Our approach achieves deep coupling between reasoning and editing, establishing the first systematic definition and evaluation of reasoning-aware video editing. On the reasoning-guided editing subset of RVE-Bench, SRF improves overall performance by 32%, significantly enhancing editing accuracy and visual fidelity over state-of-the-art methods.

Technology Category

Application Category

📝 Abstract

Video unified models exhibit strong capabilities in understanding and generation, yet they struggle with reason-informed visual editing even when equipped with powerful internal vision-language models (VLMs). We attribute this gap to two factors: 1) existing datasets are inadequate for training and evaluating reasoning-aware video editing, and 2) an inherent disconnect between the models' reasoning and editing capabilities, which prevents the rich understanding from effectively instructing the editing process. Bridging this gap requires an integrated framework that connects reasoning with visual transformation. To address this gap, we introduce the Reason-Informed Video Editing (RVE) task, which requires reasoning about physical plausibility and causal dynamics during editing. To support systematic evaluation, we construct RVE-Bench, a comprehensive benchmark with two complementary subsets: Reasoning-Informed Video Editing and In-Context Video Generation. These subsets cover diverse reasoning dimensions and real-world editing scenarios. Building upon this foundation, we propose the ReViSE, a Self-Reflective Reasoning (SRF) framework that unifies generation and evaluation within a single architecture. The model's internal VLM provides intrinsic feedback by assessing whether the edited video logically satisfies the given instruction. The differential feedback that refines the generator's reasoning behavior during training. Extensive experiments on RVE-Bench demonstrate that ReViSE significantly enhances editing accuracy and visual fidelity, achieving a 32% improvement of the Overall score in the reasoning-informed video editing subset over state-of-the-art methods.

Problem

Research questions and friction points this paper is trying to address.

Bridges reasoning and editing in video models

Addresses lack of datasets for reasoning-aware video editing

Enhances editing accuracy with self-reflective learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-reflective learning framework unifies generation and evaluation

Internal VLM provides intrinsic feedback for logical satisfaction

Differential feedback refines generator's reasoning during training

🔎 Similar Papers

No similar papers found.