Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance

πŸ“… 2026-03-02
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing video editing methods are limited in instruction-following precision and reference-guided fidelity, primarily due to the scarcity of high-quality paired data. To address this, this work proposes a scalable data generation paradigm that transforms existing editing pairs into high-fidelity quadruplets using synthetic reference scaffolds, thereby constructing RefVIEβ€”the first large-scale dataset for instruction- and reference-guided video editing. Furthermore, the authors introduce Kiwi-Edit, a unified architecture that integrates learnable queries, latent visual features, and a multi-stage progressive training strategy. Experimental results demonstrate that the proposed approach significantly outperforms current state-of-the-art methods in both instruction adherence and reference fidelity, establishing a new benchmark for controllable video editing. The dataset, model, and code are publicly released to support future research.

Technology Category

Application Category

πŸ“ Abstract
Instruction-based video editing has witnessed rapid progress, yet current methods often struggle with precise visual control, as natural language is inherently limited in describing complex visual nuances. Although reference-guided editing offers a robust solution, its potential is currently bottlenecked by the scarcity of high-quality paired training data. To bridge this gap, we introduce a scalable data generation pipeline that transforms existing video editing pairs into high-fidelity training quadruplets, leveraging image generative models to create synthesized reference scaffolds. Using this pipeline, we construct RefVIE, a large-scale dataset tailored for instruction-reference-following tasks, and establish RefVIE-Bench for comprehensive evaluation. Furthermore, we propose a unified editing architecture, Kiwi-Edit, that synergizes learnable queries and latent visual features for reference semantic guidance. Our model achieves significant gains in instruction following and reference fidelity via a progressive multi-stage training curriculum. Extensive experiments demonstrate that our data and architecture establish a new state-of-the-art in controllable video editing. All datasets, models, and code is released at https://github.com/showlab/Kiwi-Edit.
Problem

Research questions and friction points this paper is trying to address.

instruction-based video editing
reference-guided editing
visual control
training data scarcity
video editing
Innovation

Methods, ideas, or system contributions that make the work stand out.

instruction-based video editing
reference-guided editing
data generation pipeline
learnable queries
multi-stage training
πŸ”Ž Similar Papers
No similar papers found.