VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing

📅 2026-05-04

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

This work addresses the lack of effective evaluation for multimodal reasoning and procedural operability in real-world video editing among existing large models. To bridge this gap, the authors propose the first dual-dimensional benchmark framework encompassing both cognitive understanding and operational simulation of authentic video editing tasks. They construct a high-quality dataset comprising 3.9K videos and 3,080 question-answer pairs, developed through three rounds of human-AI collaborative annotation and enriched with multimodal cue analysis, temporal localization, and multi-candidate clip selection. Two core tasks—video editing technique recognition and editing operation simulation—are designed to assess model capabilities. Experiments on mainstream models, including Gemini-2.5-Pro, reveal a significant performance gap between current systems and human experts in both editing knowledge comprehension and procedural reasoning, thereby delineating critical directions for advancing intelligent video editing systems.

📝 Abstract

Real-world video editing demands not only expert knowledge of cinematic techniques but also multimodal reasoning to select, align, and combine footage into coherent narratives. While recent Large Multimodal Models (LMMs) have shown remarkable progress in general video understanding, their abilities in multi-video reasoning and operational editing workflows remain largely unexplored. We introduce VEBENCH, the first comprehensive benchmark designed to evaluate both editing knowledge understanding and operational reasoning in realistic video editing scenarios. VEBENCH contains 3.9K high-quality edited videos (over 257 hours) and 3,080 human-verified QA pairs, built through a three-round human-AI collaborative annotation pipeline that ensures precise temporal labeling and semantic consistency. It features two complementary QA tasks: 1) Video Editing Technique Recognition, assessing models' ability to identify 7 editing techniques using multimodal cues; and 2) Video Editing Operation Simulation, modeling real-world editing workflows by requiring the selection and temporal localization of relevant clips from multiple candidates. Extensive experiments across proprietary (e.g., Gemini-2.5-Pro) and open-source LMMs reveal a large gap between current model performance and human-level editing cognition. These results highlight the urgent need for bridging video understanding with creative operational reasoning. We envision VEBENCH as a foundation for advancing intelligent video editing systems and driving future research on complex reasoning.

Problem

Research questions and friction points this paper is trying to address.

video editing

large multimodal models

multimodal reasoning

editing benchmark

operational reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Multimodal Models

Video Editing Benchmark

Multimodal Reasoning