VEU-Bench: Towards Comprehensive Understanding of Video Editing

📅 2025-04-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video large language models (Vid-LLMs) lack standardized evaluation for Video Editing Understanding (VEU), a critical capability for comprehending editing techniques such as intra-shot features and inter-shot transitions. Method: This paper introduces VEU-Bench, the first fine-grained benchmark covering recognition, reasoning, and judgment stages across 19 diverse tasks. It features an ontology-driven automated annotation pipeline and a multi-stage decoupled evaluation framework. Additionally, we release Oscars, a specialized Vid-LLM fine-tuned on VEU data. Contribution/Results: Experiments show that Oscars achieves a 28.3% absolute accuracy gain over open-source Vid-LLMs on VEU-Bench, matching GPT-4o’s performance. Moreover, incorporating VEU data improves general video understanding performance by an average of 8.3%, demonstrating its transferability. VEU-Bench establishes a rigorous foundation for evaluating and advancing editing-aware video intelligence.

Technology Category

Application Category

📝 Abstract
Widely shared videos on the internet are often edited. Recently, although Video Large Language Models (Vid-LLMs) have made great progress in general video understanding tasks, their capabilities in video editing understanding (VEU) tasks remain unexplored. To address this gap, in this paper, we introduce VEU-Bench (Video Editing Understanding Benchmark), a comprehensive benchmark that categorizes video editing components across various dimensions, from intra-frame features like shot size to inter-shot attributes such as cut types and transitions. Unlike previous video editing understanding benchmarks that focus mainly on editing element classification, VEU-Bench encompasses 19 fine-grained tasks across three stages: recognition, reasoning, and judging. To enhance the annotation of VEU automatically, we built an annotation pipeline integrated with an ontology-based knowledge base. Through extensive experiments with 11 state-of-the-art Vid-LLMs, our findings reveal that current Vid-LLMs face significant challenges in VEU tasks, with some performing worse than random choice. To alleviate this issue, we develop Oscars, a VEU expert model fine-tuned on the curated VEU-Bench dataset. It outperforms existing open-source Vid-LLMs on VEU-Bench by over 28.3% in accuracy and achieves performance comparable to commercial models like GPT-4o. We also demonstrate that incorporating VEU data significantly enhances the performance of Vid-LLMs on general video understanding benchmarks, with an average improvement of 8.3% across nine reasoning tasks.
Problem

Research questions and friction points this paper is trying to address.

Assessing Video Large Language Models' video editing understanding capabilities
Creating a comprehensive benchmark for video editing understanding tasks
Improving Vid-LLMs' performance on video editing and general understanding tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed VEU-Bench for video editing understanding
Built ontology-based annotation pipeline automatically
Created Oscars model outperforming Vid-LLMs significantly
🔎 Similar Papers
No similar papers found.
B
Bozheng Li
Opus AI Research, Brown University
Yongliang Wu
Yongliang Wu
Southeast University
Vision-Language Model
Y
Yi Lu
Opus AI Research, University of Toronto
Jiashuo Yu
Jiashuo Yu
Shanghai AI Laboratory
Audio-Visual LearningComputer VisionMultimodal Learning
L
Licheng Tang
Opus AI Research
J
Jiawang Cao
Opus AI Research
W
Wenqing Zhu
Opus AI Research
Y
Yuyang Sun
Opus AI Research
J
Jay Wu
Opus AI Research
W
Wenbo Zhu
Opus AI Research