🤖 AI Summary
This work addresses the lack of effective evaluation for multimodal reasoning and procedural operability in real-world video editing among existing large models. To bridge this gap, the authors propose the first dual-dimensional benchmark framework encompassing both cognitive understanding and operational simulation of authentic video editing tasks. They construct a high-quality dataset comprising 3.9K videos and 3,080 question-answer pairs, developed through three rounds of human-AI collaborative annotation and enriched with multimodal cue analysis, temporal localization, and multi-candidate clip selection. Two core tasks—video editing technique recognition and editing operation simulation—are designed to assess model capabilities. Experiments on mainstream models, including Gemini-2.5-Pro, reveal a significant performance gap between current systems and human experts in both editing knowledge comprehension and procedural reasoning, thereby delineating critical directions for advancing intelligent video editing systems.
📝 Abstract
Real-world video editing demands not only expert knowledge of cinematic techniques but also multimodal reasoning to select, align, and combine footage into coherent narratives. While recent Large Multimodal Models (LMMs) have shown remarkable progress in general video understanding, their abilities in multi-video reasoning and operational editing workflows remain largely unexplored. We introduce VEBENCH, the first comprehensive benchmark designed to evaluate both editing knowledge understanding and operational reasoning in realistic video editing scenarios. VEBENCH contains 3.9K high-quality edited videos (over 257 hours) and 3,080 human-verified QA pairs, built through a three-round human-AI collaborative annotation pipeline that ensures precise temporal labeling and semantic consistency. It features two complementary QA tasks: 1) Video Editing Technique Recognition, assessing models' ability to identify 7 editing techniques using multimodal cues; and 2) Video Editing Operation Simulation, modeling real-world editing workflows by requiring the selection and temporal localization of relevant clips from multiple candidates. Extensive experiments across proprietary (e.g., Gemini-2.5-Pro) and open-source LMMs reveal a large gap between current model performance and human-level editing cognition. These results highlight the urgent need for bridging video understanding with creative operational reasoning. We envision VEBENCH as a foundation for advancing intelligent video editing systems and driving future research on complex reasoning.