SketchVideo: Sketch-based Video Generation and Editing

๐Ÿ“… 2025-03-30
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the lack of fine-grained spatio-temporal co-control in video generation and editing. We propose the first controllable video generation and editing framework guided by temporally sparse hand-drawn sketchesโ€”either single-frame or two-frame inputs. Methodologically, we design a residual sketch-conditioning module and a cross-frame attention mechanism to efficiently propagate sketch guidance across the entire video sequence. Additionally, we introduce a video inpainting module coupled with latent-space fusion to ensure geometric accuracy and motion coherence within edited regions while preserving high fidelity in unedited areas. Built upon the DiT architecture, our approach achieves state-of-the-art performance on multiple controllable video generation and editing benchmarks, delivering simultaneous improvements in spatial precision, motion consistency, and interactive usability.

Technology Category

Application Category

๐Ÿ“ Abstract
Video generation and editing conditioned on text prompts or images have undergone significant advancements. However, challenges remain in accurately controlling global layout and geometry details solely by texts, and supporting motion control and local modification through images. In this paper, we aim to achieve sketch-based spatial and motion control for video generation and support fine-grained editing of real or synthetic videos. Based on the DiT video generation model, we propose a memory-efficient control structure with sketch control blocks that predict residual features of skipped DiT blocks. Sketches are drawn on one or two keyframes (at arbitrary time points) for easy interaction. To propagate such temporally sparse sketch conditions across all frames, we propose an inter-frame attention mechanism to analyze the relationship between the keyframes and each video frame. For sketch-based video editing, we design an additional video insertion module that maintains consistency between the newly edited content and the original video's spatial feature and dynamic motion. During inference, we use latent fusion for the accurate preservation of unedited regions. Extensive experiments demonstrate that our SketchVideo achieves superior performance in controllable video generation and editing.
Problem

Research questions and friction points this paper is trying to address.

Achieve sketch-based spatial and motion control for video generation
Support fine-grained editing of real or synthetic videos
Control global layout and geometry details via sketches
Innovation

Methods, ideas, or system contributions that make the work stand out.

Memory-efficient control structure with sketch blocks
Inter-frame attention for sparse sketch propagation
Video insertion module for consistent editing
๐Ÿ”Ž Similar Papers
No similar papers found.
F
Feng-Lin Liu
Beijing Key Laboratory of Mobile Computing and Pervasive Device, Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences
Hongbo Fu
Hongbo Fu
Professor and Acting Head, Arts and Machine Creativity, HKUST
Computer GraphicsHuman-Computer InteractionComputer Vision
X
Xintao Wang
Kuaishou Technology
Weicai Ye
Weicai Ye
Kling Team, Kuaishou Technology
Multimodal Generative Foundation ModelsWorld Model3D VisionEmbodied AIAGI
Pengfei Wan
Pengfei Wan
Head of Kling Video Generation Models, Kuaishou Technology
Generative ModelsComputer VisionMultimodal AIComputer Graphics
D
Di Zhang
Kuaishou Technology
L
Lin Gao
Beijing Key Laboratory of Mobile Computing and Pervasive Device, Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences