π€ AI Summary
Existing video editing methods struggle to simultaneously achieve high quality and fidelity to user intent in few-step generation, often relying on time-consuming iterative optimization. This work proposes a training-free, streaming video editing framework built upon a pretrained streaming generative model. By integrating dual-branch few-step sampling, self-attention bridging, cross-attention anchoring and enhancement, source-oriented guidance, and visual prompting strategies, the approach transcends the limitations of conventional βdata-to-dataβ paradigms. The method demonstrates significant performance gains over state-of-the-art techniques across diverse editing tasks, achieving high-quality results with remarkable efficiency and strong generalization capabilities in few-step video editing.
π Abstract
Although existing video editing methods are generally feasible, they often require many costly iterations and still struggle to deliver high-quality yet satisfying editing results. We attribute this limitation to the prevalent data-to-data paradigm, which is less compatible with modern generative models than noise-to-data generation. To address this gap, we revisit video editing from a noise-to-data perspective and propose Streaming-Generation-based Video Editing (StreamGVE), which preserves few-step sampling while seamlessly injecting source-video conditions. Built on pre-trained streaming generation models, StreamGVE introduces dual-branch fast sampling with a self-attention bridge and cross-attention grounding/boosting to satisfy both sampling and conditioning requirements. We further propose source-oriented guidance to improve target-generation quality, and a visual prompting strategy to enhance editing flexibility and practicality. The method is effective, robust, and generalizable across different models. Extensive experiments on diverse video editing tasks show that StreamGVE consistently outperforms existing approaches, even in few-step settings with minimal time cost.