🤖 AI Summary
Existing unified video models exhibit strong understanding and generation capabilities but struggle with physics- and causality-aware visual editing, primarily due to the absence of reasoning-oriented evaluation benchmarks and an inherent decoupling between model understanding and editing functionality.
Method: We introduce Reason-Informed Video Editing (RVE), a novel task requiring explicit modeling of physical plausibility and causal dynamics during editing. We construct RVE-Bench—the first dedicated benchmark—and propose the Self-Reflective Reasoning (SRF) framework, which integrates an internal vision-language model (VLM) as a reasoning evaluator to provide closed-loop feedback for generator refinement.
Contribution/Results: Our approach achieves deep coupling between reasoning and editing, establishing the first systematic definition and evaluation of reasoning-aware video editing. On the reasoning-guided editing subset of RVE-Bench, SRF improves overall performance by 32%, significantly enhancing editing accuracy and visual fidelity over state-of-the-art methods.
📝 Abstract
Video unified models exhibit strong capabilities in understanding and generation, yet they struggle with reason-informed visual editing even when equipped with powerful internal vision-language models (VLMs). We attribute this gap to two factors: 1) existing datasets are inadequate for training and evaluating reasoning-aware video editing, and 2) an inherent disconnect between the models' reasoning and editing capabilities, which prevents the rich understanding from effectively instructing the editing process. Bridging this gap requires an integrated framework that connects reasoning with visual transformation. To address this gap, we introduce the Reason-Informed Video Editing (RVE) task, which requires reasoning about physical plausibility and causal dynamics during editing. To support systematic evaluation, we construct RVE-Bench, a comprehensive benchmark with two complementary subsets: Reasoning-Informed Video Editing and In-Context Video Generation. These subsets cover diverse reasoning dimensions and real-world editing scenarios. Building upon this foundation, we propose the ReViSE, a Self-Reflective Reasoning (SRF) framework that unifies generation and evaluation within a single architecture. The model's internal VLM provides intrinsic feedback by assessing whether the edited video logically satisfies the given instruction. The differential feedback that refines the generator's reasoning behavior during training. Extensive experiments on RVE-Bench demonstrate that ReViSE significantly enhances editing accuracy and visual fidelity, achieving a 32% improvement of the Overall score in the reasoning-informed video editing subset over state-of-the-art methods.