IVEBench: Modern Benchmark Suite for Instruction-Guided Video Editing Assessment

📅 2025-10-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video editing benchmarks suffer from insufficient source video diversity, narrow task coverage, and unidimensional evaluation criteria, hindering systematic assessment of instruction-guided video editing. To address this, we introduce InstructVidBench—the first modern benchmark tailored to this task—comprising 600 high-quality source videos and 35 fine-grained editing tasks across eight broad categories, enabling evaluation of complex semantic understanding and multi-step instruction following. We propose a novel three-dimensional evaluation protocol—Quality, Consistency, and Fidelity—that integrates traditional metrics with automated scoring from multimodal large language models (e.g., Video-LLaVA), achieving strong alignment with human judgments (Spearman’s ρ > 0.89). Instructions are synthetically generated by LLMs and rigorously validated by domain experts to ensure semantic accuracy and task feasibility. Extensive experiments demonstrate that InstructVidBench effectively discriminates state-of-the-art methods, substantially improving the systematicity, reliability, and generalizability of video editing evaluation.

Technology Category

Application Category

📝 Abstract
Instruction-guided video editing has emerged as a rapidly advancing research direction, offering new opportunities for intuitive content transformation while also posing significant challenges for systematic evaluation. Existing video editing benchmarks fail to support the evaluation of instruction-guided video editing adequately and further suffer from limited source diversity, narrow task coverage and incomplete evaluation metrics. To address the above limitations, we introduce IVEBench, a modern benchmark suite specifically designed for instruction-guided video editing assessment. IVEBench comprises a diverse database of 600 high-quality source videos, spanning seven semantic dimensions, and covering video lengths ranging from 32 to 1,024 frames. It further includes 8 categories of editing tasks with 35 subcategories, whose prompts are generated and refined through large language models and expert review. Crucially, IVEBench establishes a three-dimensional evaluation protocol encompassing video quality, instruction compliance and video fidelity, integrating both traditional metrics and multimodal large language model-based assessments. Extensive experiments demonstrate the effectiveness of IVEBench in benchmarking state-of-the-art instruction-guided video editing methods, showing its ability to provide comprehensive and human-aligned evaluation outcomes.
Problem

Research questions and friction points this paper is trying to address.

Existing benchmarks inadequately evaluate instruction-guided video editing
Current benchmarks have limited source diversity and task coverage
Existing evaluation metrics for video editing are incomplete
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces IVEBench benchmark suite for video editing assessment
Includes diverse video database with seven semantic dimensions
Establishes three-dimensional evaluation protocol with multimodal metrics
🔎 Similar Papers
No similar papers found.
Y
Yinan Chen
Zhejiang University
J
Jiangning Zhang
Tencent Youtu Lab
T
Teng Hu
Shanghai Jiao Tong University
Yuxiang Zeng
Yuxiang Zeng
Beihang University
Vector DatabasesFederated DatabasesSpatial Data Analytics
Z
Zhucun Xue
Zhejiang University
Qingdong He
Qingdong He
Tencent Youtu Lab
Computer visionGenerative AI3D Vision
C
Chengjie Wang
Tencent Youtu Lab, Shanghai Jiao Tong University
Y
Yong Liu
Zhejiang University
Xiaobin Hu
Xiaobin Hu
Tencent Youtu Lab;Technische Universität München (TUM)
Deep learningComputer visionVLMAgents
S
Shuicheng Yan
National University of Singapore