VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

📅 2026-04-17

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work addresses the lack of large-scale human-annotated data and standardized evaluation criteria in video editing assessment, which hinders accurate measurement of instruction adherence, rendering quality, and edit specificity. To this end, the authors introduce VEFX-Dataset, comprising 5,049 examples spanning 9 major categories and 32 subcategories of editing tasks, along with a novel decoupled three-dimensional quality annotation framework. Building upon this dataset, they develop VEFX-Reward, a specialized reward model that leverages multimodal joint modeling and ordinal regression to enable end-to-end video editing quality scoring. Additionally, they release VEFX-Bench, a standardized benchmark with 300 curated samples. Experiments demonstrate that VEFX-Reward substantially outperforms general-purpose vision-language models and existing reward models, aligning more closely with human judgments and uncovering critical gaps in current systems regarding visual plausibility, instruction following, and localized editing fidelity.

Technology Category

Application Category

📝 Abstract

As AI-assisted video creation becomes increasingly practical, instruction-guided video editing has become essential for refining generated or captured footage to meet professional requirements. Yet the field still lacks both a large-scale human-annotated dataset with complete editing examples and a standardized evaluator for comparing editing systems. Existing resources are limited by small scale, missing edited outputs, or the absence of human quality labels, while current evaluation often relies on expensive manual inspection or generic vision-language model judges that are not specialized for editing quality. We introduce VEFX-Dataset, a human-annotated dataset containing 5,049 video editing examples across 9 major editing categories and 32 subcategories, each labeled along three decoupled dimensions: Instruction Following, Rendering Quality, and Edit Exclusivity. Building on VEFX-Dataset, we propose VEFX-Reward, a reward model designed specifically for video editing quality assessment. VEFX-Reward jointly processes the source video, the editing instruction, and the edited video, and predicts per-dimension quality scores via ordinal regression. We further release VEFX-Bench, a benchmark of 300 curated video-prompt pairs for standardized comparison of editing systems. Experiments show that VEFX-Reward aligns more strongly with human judgments than generic VLM judges and prior reward models on both standard IQA/VQA metrics and group-wise preference evaluation. Using VEFX-Reward as an evaluator, we benchmark representative commercial and open-source video editing systems, revealing a persistent gap between visual plausibility, instruction following, and edit locality in current models.

Problem

Research questions and friction points this paper is trying to address.

video editing

benchmark

human-annotated dataset

quality assessment

instruction-guided editing

Innovation

Methods, ideas, or system contributions that make the work stand out.

video editing benchmark

human-annotated dataset

reward model