VIVA: VLM-Guided Instruction-Based Video Editing with Reward Optimization

📅 2025-12-18

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Existing diffusion-based instruction-guided video editing methods suffer from poor generalization to complex, realistic instructions due to reliance on simplistic paired training data. To address this, we propose the first Vision-Language Model (VLM)-guided visual-linguistic instruction encoding mechanism, which precisely maps natural language instructions into spatiotemporally coherent editing signals. We further design Edit-GRPO, a video-editing-optimized reward framework integrating post-training reward modeling to enhance policy robustness. Additionally, we introduce a synthetic paired-data generation pipeline grounded in primitive editing operations to mitigate the scarcity of real human annotations. Our method adopts a diffusion Transformer architecture, jointly optimizing content fidelity and inter-frame temporal coherence. Extensive evaluation demonstrates state-of-the-art performance across multi-task generalization, instruction adherence, and editing quality—particularly improving accuracy and visual naturalness for complex semantic edits.

Technology Category

Application Category

📝 Abstract

Instruction-based video editing aims to modify an input video according to a natural-language instruction while preserving content fidelity and temporal coherence. However, existing diffusion-based approaches are often trained on paired data of simple editing operations, which fundamentally limits their ability to generalize to diverse and complex, real-world instructions. To address this generalization gap, we propose VIVA, a scalable framework for instruction-based video editing that leverages VLM-guided encoding and reward optimization. First, we introduce a VLM-based instructor that encodes the textual instruction, the first frame of the source video, and an optional reference image into visually-grounded instruction representations, providing fine-grained spatial and semantic context for the diffusion transformer backbone. Second, we propose a post-training stage, Edit-GRPO, which adapts Group Relative Policy Optimization to the domain of video editing, directly optimizing the model for instruction-faithful, content-preserving, and aesthetically pleasing edits using relative rewards. Furthermore, we propose a data construction pipeline designed to synthetically generate diverse, high-fidelity paired video-instruction data of basic editing operations. Extensive experiments show that VIVA achieves superior instruction following, generalization, and editing quality over state-of-the-art methods. Website: https://viva-paper.github.io

Problem

Research questions and friction points this paper is trying to address.

Enhances generalization to complex real-world editing instructions

Improves content fidelity and temporal coherence in video edits

Optimizes instruction-following and aesthetic quality via reward-based training

Innovation

Methods, ideas, or system contributions that make the work stand out.

VLM-guided instruction encoding for spatial-semantic context

Post-training reward optimization for instruction-faithful edits

Synthetic data pipeline for diverse video-instruction pairs

🔎 Similar Papers

No similar papers found.