MiVE: Multiscale Vision-language features for reference-guided video Editing

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This work addresses the challenges of modal misalignment and loss of spatial detail in reference-guided video editing by introducing MiVE, a novel framework that effectively preserves original video motion and unedited content while adhering to both textual instructions and reference images. MiVE is the first to systematically leverage multi-scale hierarchical features from a vision-language model (Qwen3-VL), integrating fine-grained spatial details from early layers with high-level semantic information from deeper layers. Built upon a unified self-attention diffusion Transformer architecture, MiVE eliminates cross-attention mechanisms to avoid modality mismatch. In human preference evaluations, MiVE significantly outperforms existing academic approaches and leading commercial systems, establishing a new state of the art by achieving coherent preservation of both semantic meaning and structural fidelity.

📝 Abstract

Reference-guided video editing takes a source video, a text instruction, and a reference image as inputs, requiring the model to faithfully apply the instructed edits while preserving original motion and unedited content. Existing methods fall into two paradigms, each with inherent limitations: decoupled encoders suffer from modality gaps when processing instructions and visual content independently, while unified vision-language encoders lose fine-grained spatial details by relying solely on final-layer representations. We observe that VLM layers encode complementary information hierarchically -- early layers capture localized spatial details essential for precise editing, while deeper layers encode global semantics for instruction comprehension. Building on this insight, we present MiVE (Multiscale Vision-language features for reference-guided video Editing), a framework that repurposes VLMs as multiscale feature extractors. MiVE extracts hierarchical features from Qwen3-VL and integrates them into a unified self-attention Diffusion Transformer, eliminating the modality mismatch inherent in cross-attention designs. Experiments demonstrate that MiVE achieves state-of-the-art performance by ranking highest in human preference, outperforming both academic methods and commercial systems.

Problem

Research questions and friction points this paper is trying to address.

reference-guided video editing

vision-language models

modality gap

spatial details

multiscale features

Innovation

Methods, ideas, or system contributions that make the work stand out.

multiscale vision-language features

reference-guided video editing

diffusion transformer