🤖 AI Summary
This work addresses the scarcity of high-quality instruction-based video editing data by introducing a context-aware visual editing framework built upon HunyuanVideo-1.5. The proposed approach features a Mutual Contextual Attention (MCA) architecture and integrates video diffusion model fine-tuning, cross-modal instruction alignment, and a multi-stage data synthesis strategy, enabling effective training with only approximately 100,000 editing samples. Evaluated on a newly curated video instruction editing benchmark, the method achieves state-of-the-art performance among open-source solutions. Notably, it generalizes seamlessly to generic image editing tasks without architectural modifications, substantially improving both data efficiency and model generalization.
📝 Abstract
Instruction-based video editing is a natural way to control video content with text, but adapting a video generation model into an editor usually appears data-hungry. At the same time, high-quality video editing data remains scarce. In this paper, we show that a video generation backbone can become a strong video editor without large scale video editing data. We present InsEdit, an instruction-based editing model built on HunyuanVideo-1.5. InsEdit combines a visual editing architecture with a video data pipeline based on Mutual Context Attention (MCA), which creates aligned video pairs where edits can begin in the middle of a clip rather than only from the first frame. With only O(100)K video editing data, InsEdit achieves state-of-the-art results among open-source methods on our video instruction editing benchmarks. In addition, because our training recipe also includes image editing data, the final model supports image editing without any modification.