🤖 AI Summary
Existing large-scale video diffusion models generate diverse short clips but face three key bottlenecks in long-form, cinematic video generation: distorted human-scene interactions, inconsistent subject identity, and prohibitive training costs. This paper proposes a training-free, three-stage framework—scriptwriting → pre-visualization → animation synthesis—that synthesizes long-duration, identity-consistent, and naturally interactive human-scene videos from a single scene image. Our core contributions are: (1) the first training-free method enabling camera-pose coherence, diverse motion generation, and identity control in long-video synthesis; and (2) a 3D-aware controllable pipeline via task decomposition, implicit 3D keyframe reconstruction (without 3D scanning), and storyboard-guided staged composition. Experiments demonstrate substantial improvements in content fidelity, identity consistency, and interaction realism, while supporting arbitrary action combinations.
📝 Abstract
Large-scale pre-trained video diffusion models have exhibited remarkable capabilities in diverse video generation. However, existing solutions face several challenges in using these models to generate long movie-like videos with rich human-object interactions that include unrealistic human-scene interaction, lack of subject identity preservation, and require expensive training. We propose GenHSI, a training-free method for controllable generation of long human-scene interaction videos (HSI). Taking inspiration from movie animation, our key insight is to overcome the limitations of previous work by subdividing the long video generation task into three stages: (1) script writing, (2) pre-visualization, and (3) animation. Given an image of a scene, a user description, and multiple images of a person, we use these three stages to generate long-videos that preserve human-identity and provide rich human-scene interactions. Script writing converts complex human tasks into simple atomic tasks that are used in the pre-visualization stage to generate 3D keyframes (storyboards). These 3D keyframes are rendered and animated by off-the-shelf video diffusion models for consistent long video generation with rich contacts in a 3D-aware manner. A key advantage of our work is that we alleviate the need for scanned, accurate scenes and create 3D keyframes from single-view images. We are the first to generate a long video sequence with a consistent camera pose that contains arbitrary numbers of character actions without training. Experiments demonstrate that our method can generate long videos that effectively preserve scene content and character identity with plausible human-scene interaction from a single image scene. Visit our project homepage https://kunkun0w0.github.io/project/GenHSI/ for more information.