🤖 AI Summary
Existing video editing benchmarks suffer from insufficient source video diversity, narrow task coverage, and unidimensional evaluation criteria, hindering systematic assessment of instruction-guided video editing. To address this, we introduce InstructVidBench—the first modern benchmark tailored to this task—comprising 600 high-quality source videos and 35 fine-grained editing tasks across eight broad categories, enabling evaluation of complex semantic understanding and multi-step instruction following. We propose a novel three-dimensional evaluation protocol—Quality, Consistency, and Fidelity—that integrates traditional metrics with automated scoring from multimodal large language models (e.g., Video-LLaVA), achieving strong alignment with human judgments (Spearman’s ρ > 0.89). Instructions are synthetically generated by LLMs and rigorously validated by domain experts to ensure semantic accuracy and task feasibility. Extensive experiments demonstrate that InstructVidBench effectively discriminates state-of-the-art methods, substantially improving the systematicity, reliability, and generalizability of video editing evaluation.
📝 Abstract
Instruction-guided video editing has emerged as a rapidly advancing research direction, offering new opportunities for intuitive content transformation while also posing significant challenges for systematic evaluation. Existing video editing benchmarks fail to support the evaluation of instruction-guided video editing adequately and further suffer from limited source diversity, narrow task coverage and incomplete evaluation metrics. To address the above limitations, we introduce IVEBench, a modern benchmark suite specifically designed for instruction-guided video editing assessment. IVEBench comprises a diverse database of 600 high-quality source videos, spanning seven semantic dimensions, and covering video lengths ranging from 32 to 1,024 frames. It further includes 8 categories of editing tasks with 35 subcategories, whose prompts are generated and refined through large language models and expert review. Crucially, IVEBench establishes a three-dimensional evaluation protocol encompassing video quality, instruction compliance and video fidelity, integrating both traditional metrics and multimodal large language model-based assessments. Extensive experiments demonstrate the effectiveness of IVEBench in benchmarking state-of-the-art instruction-guided video editing methods, showing its ability to provide comprehensive and human-aligned evaluation outcomes.