Se~norita-2M: A High-Quality Instruction-based Dataset for General Video Editing by Video Specialists

📅 2025-02-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing end-to-end video editing methods suffer from scarcity of high-quality paired training data, while inversion-based approaches—though training-free—exhibit slow inference, poor fine-grained controllability, and susceptibility to artifacts and temporal flickering. Method: We introduce Señorita-2M, the first instruction-driven, high-fidelity dataset for general-purpose video editing, comprising ~2 million precisely aligned original-edited video pairs. Our method employs a novel multi-model fusion editor (diffusion + Transformer) co-designed by four video domain experts, integrated with instruction-aware alignment training and a multi-stage automated quality filtering and evaluation pipeline. Contribution/Results: Our approach achieves state-of-the-art editing quality across multiple benchmarks, significantly suppressing flickering and artifacts. It accelerates inference over inversion-based methods by over 100× and enables the first high-fidelity, fine-grained semantic instruction-driven video editing.

Technology Category

Application Category

📝 Abstract
Recent advancements in video generation have spurred the development of video editing techniques, which can be divided into inversion-based and end-to-end methods. However, current video editing methods still suffer from several challenges. Inversion-based methods, though training-free and flexible, are time-consuming during inference, struggle with fine-grained editing instructions, and produce artifacts and jitter. On the other hand, end-to-end methods, which rely on edited video pairs for training, offer faster inference speeds but often produce poor editing results due to a lack of high-quality training video pairs. In this paper, to close the gap in end-to-end methods, we introduce Se~norita-2M, a high-quality video editing dataset. Se~norita-2M consists of approximately 2 millions of video editing pairs. It is built by crafting four high-quality, specialized video editing models, each crafted and trained by our team to achieve state-of-the-art editing results. We also propose a filtering pipeline to eliminate poorly edited video pairs. Furthermore, we explore common video editing architectures to identify the most effective structure based on current pre-trained generative model. Extensive experiments show that our dataset can help to yield remarkably high-quality video editing results. More details are available at https://senorita.github.io.
Problem

Research questions and friction points this paper is trying to address.

Enhance video editing quality
Address end-to-end method limitations
Provide high-quality training dataset
Innovation

Methods, ideas, or system contributions that make the work stand out.

High-quality video editing dataset
Specialized video editing models
Filtering pipeline for quality control
🔎 Similar Papers
No similar papers found.
Bojia Zi
Bojia Zi
The Chinese University of Hong Kong
AGI
Penghui Ruan
Penghui Ruan
The Hong Kong Polytechnic University
Text-to-Video GenerationComputer Vision
M
Marco Chen
Tsinghua University
Xianbiao Qi
Xianbiao Qi
Shenzhen Intellifusion Technologies Co., Ltd.
Neural Network OptimizationGenerative ModelsLarge-Scale Pretrain ModelsOCR
S
Shaozhe Hao
The University of Hong Kong
Shihao Zhao
Shihao Zhao
The University of Hong Kong
Generative AIRobust AI
Y
Youze Huang
University of Electronic Science and Technology of China
B
Bin Liang
The Chinese University of Hong Kong
R
Rong Xiao
IntelliFusion Inc.
K
Kam-Fai Wong
The Chinese University of Hong Kong