🤖 AI Summary
This work investigates the potential of text-to-video models as interactive world simulators, focusing on generating semantically coherent and visually harmonious human–environment interaction videos from a single scene image and an action-oriented text prompt. Methodologically, we propose an implicit affordance modeling framework that requires no bounding-box or pose annotations; it leverages cross-attention heatmap analysis to uncover pre-trained video models’ intrinsic scene functionality perception, then fine-tunes the model for action-driven character insertion and behavior-consistent generation. Our key contribution is the first empirical demonstration and exploitation of large video models’ implicit encoding of scene affordances—enabling natural interaction modeling without explicit geometric supervision. Experiments show that our approach generates high-quality, behaviorally plausible, and visually consistent interaction videos across diverse complex scenes, significantly enhancing the practical utility of text-to-video generation for embodied intelligence and virtual environment simulation.
📝 Abstract
Can a video generation model be repurposed as an interactive world simulator? We explore the affordance perception potential of text-to-video models by teaching them to predict human-environment interaction. Given a scene image and a prompt describing human actions, we fine-tune the model to insert a person into the scene, while ensuring coherent behavior, appearance, harmonization, and scene affordance. Unlike prior work, we infer human affordance for video generation (i.e., where to insert a person and how they should behave) from a single scene image, without explicit conditions like bounding boxes or body poses. An in-depth study of cross-attention heatmaps demonstrates that we can uncover the inherent affordance perception of a pre-trained video model without labeled affordance datasets.