OpenVE-3M: A Large-Scale High-Quality Dataset for Instruction-Guided Video Editing

📅 2025-12-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current instruction-guided video editing is hindered by two major bottlenecks: scarcity of high-quality training data and absence of standardized, human-aligned evaluation benchmarks. To address these challenges, we introduce OpenVE-3M—the first large-scale, multi-type instruction-video editing dataset comprising both spatially aligned and unaligned pairs. It features a fine-grained taxonomy covering eight distinct editing operations and employs a high-precision automated generation pipeline coupled with rigorous quality filtering. Concurrently, we release OpenVE-Bench—the first unified evaluation benchmark exhibiting strong correlation with human judgments. Leveraging OpenVE-3M, we train OpenVE-Edit, a 5B-parameter open-source model that achieves state-of-the-art performance on OpenVE-Bench, significantly outperforming all existing open-source models—including a 14B-parameter baseline—while markedly improving editing accuracy and inference efficiency.

Technology Category

Application Category

📝 Abstract
The quality and diversity of instruction-based image editing datasets are continuously increasing, yet large-scale, high-quality datasets for instruction-based video editing remain scarce. To address this gap, we introduce OpenVE-3M, an open-source, large-scale, and high-quality dataset for instruction-based video editing. It comprises two primary categories: spatially-aligned edits (Global Style, Background Change, Local Change, Local Remove, Local Add, and Subtitles Edit) and non-spatially-aligned edits (Camera Multi-Shot Edit and Creative Edit). All edit types are generated via a meticulously designed data pipeline with rigorous quality filtering. OpenVE-3M surpasses existing open-source datasets in terms of scale, diversity of edit types, instruction length, and overall quality. Furthermore, to address the lack of a unified benchmark in the field, we construct OpenVE-Bench, containing 431 video-edit pairs that cover a diverse range of editing tasks with three key metrics highly aligned with human judgment. We present OpenVE-Edit, a 5B model trained on our dataset that demonstrates remarkable efficiency and effectiveness by setting a new state-of-the-art on OpenVE-Bench, outperforming all prior open-source models including a 14B baseline. Project page is at https://github.com/lewandofskee/OpenVE.
Problem

Research questions and friction points this paper is trying to address.

Creating a large-scale high-quality dataset for instruction-guided video editing
Addressing the lack of a unified benchmark for video editing evaluation
Developing an efficient model to set new state-of-the-art performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale dataset for instruction-based video editing
Comprehensive benchmark with human-aligned evaluation metrics
Efficient 5B model outperforming larger baseline models
🔎 Similar Papers
No similar papers found.
H
Haoyang He
Zhejiang University, ByteDance
J
Jie Wang
ByteDance
J
Jiangning Zhang
Zhejiang University
Z
Zhucun Xue
Zhejiang University
X
Xingyuan Bu
ByteDance
Q
Qiangpeng Yang
ByteDance
Shilei Wen
Shilei Wen
bytedance.com
computer visionmachine learning
L
Lei Xie
Zhejiang University