π€ AI Summary
Existing text-to-video methods struggle to achieve fine-grained control over pose, composition, and motion across extended temporal durations, resulting in long videos that lack structural coherence and user controllability. This work proposes DrawVideo, a novel framework that enables controllable long-form video generation guided by storyboard sketches for the first time. The approach decomposes a video into individual shots, each jointly defined by a grayscale sketch, appearance prompt, and motion prompt, and employs a hierarchical synthesis strategyβglobal multi-shot planning with local single-sketch rendering. Key contributions include a structure-aligned keyframe mechanism, the introduction of SketchLongVideo (the first sketch-guided long video dataset), and an integrated pipeline combining sketch-based control, prompt decomposition, keyframe extrapolation, and interpolation. Experiments demonstrate that DrawVideo significantly outperforms existing methods in structural controllability, appearance consistency, visual stability, and narrative coherence.
π Abstract
Long video generation requires high-fidelity synthesis, coherent narrative structure, and user control over extended time spans. Existing text-to-video methods often rely on a single long prompt, limiting control over pose, composition, layout, and motion. We propose DrawVideo, a sketch-guided, storyboard-driven framework for controllable long-video generation. DrawVideo decomposes long videos into independently controllable shots, each defined by a black-and-white sketch, an appearance prompt, and a motion prompt. The sketch controls pose and layout, the appearance prompt defines identity, scene, and style, and the motion prompt guides temporal dynamics. DrawVideo follows a hierarchical 'global multi-shot, local single-sketch' strategy: it first generates a structure-aligned reference keyframe, then expands the motion prompt into derivative keyframes representing action states, and finally synthesizes clips between adjacent keyframes to build each shot. We also introduce SketchLongVideo, the first dataset for sketch-guided text-to-long-video generation, constructed from animation videos via shot detection, keyframe extraction, vision-language recognition, prompt decomposition, and sketch conversion. Experiments show that DrawVideo achieves strong structural controllability, appearance consistency, visual stability, and coherent long-video generation.