Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset

📅 2025-10-17

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Instruction-driven video editing suffers from limited high-quality, large-scale training data, resulting in insufficient creative diversity and temporal coherence. To address this, we propose Ditto, the first framework that synergistically integrates image-editing diversity with contextual video generation capabilities via a novel data synthesis pipeline. Specifically, an intelligent instruction agent generates and filters diverse editing instructions, enabling construction of Ditto-1M—a million-scale, high-fidelity video editing dataset (requiring >12,000 GPU-days). We further design a lightweight distillation architecture augmented with a temporal enhancement module, significantly improving temporal consistency while reducing computational cost. The resulting Editto model, trained on Ditto-1M, achieves state-of-the-art performance in instruction following. Our work establishes a scalable data paradigm and an efficient model architecture for instruction-based video editing.

Technology Category

Application Category

📝 Abstract

Instruction-based video editing promises to democratize content creation, yet its progress is severely hampered by the scarcity of large-scale, high-quality training data. We introduce Ditto, a holistic framework designed to tackle this fundamental challenge. At its heart, Ditto features a novel data generation pipeline that fuses the creative diversity of a leading image editor with an in-context video generator, overcoming the limited scope of existing models. To make this process viable, our framework resolves the prohibitive cost-quality trade-off by employing an efficient, distilled model architecture augmented by a temporal enhancer, which simultaneously reduces computational overhead and improves temporal coherence. Finally, to achieve full scalability, this entire pipeline is driven by an intelligent agent that crafts diverse instructions and rigorously filters the output, ensuring quality control at scale. Using this framework, we invested over 12,000 GPU-days to build Ditto-1M, a new dataset of one million high-fidelity video editing examples. We trained our model, Editto, on Ditto-1M with a curriculum learning strategy. The results demonstrate superior instruction-following ability and establish a new state-of-the-art in instruction-based video editing.

Problem

Research questions and friction points this paper is trying to address.

Addressing scarcity of high-quality training data for video editing

Overcoming prohibitive cost-quality trade-off in video generation

Achieving scalable production of diverse instruction-video pairs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel data generation pipeline fuses image editor with video generator

Efficient distilled model architecture augmented by temporal enhancer

Intelligent agent crafts diverse instructions and filters output quality

🔎 Similar Papers

Video Instruction Tuning With Synthetic Data