Populate-A-Scene: Affordance-Aware Human Video Generation

📅 2025-06-30

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work investigates the potential of text-to-video models as interactive world simulators, focusing on generating semantically coherent and visually harmonious human–environment interaction videos from a single scene image and an action-oriented text prompt. Methodologically, we propose an implicit affordance modeling framework that requires no bounding-box or pose annotations; it leverages cross-attention heatmap analysis to uncover pre-trained video models’ intrinsic scene functionality perception, then fine-tunes the model for action-driven character insertion and behavior-consistent generation. Our key contribution is the first empirical demonstration and exploitation of large video models’ implicit encoding of scene affordances—enabling natural interaction modeling without explicit geometric supervision. Experiments show that our approach generates high-quality, behaviorally plausible, and visually consistent interaction videos across diverse complex scenes, significantly enhancing the practical utility of text-to-video generation for embodied intelligence and virtual environment simulation.

Technology Category

Application Category

📝 Abstract

Can a video generation model be repurposed as an interactive world simulator? We explore the affordance perception potential of text-to-video models by teaching them to predict human-environment interaction. Given a scene image and a prompt describing human actions, we fine-tune the model to insert a person into the scene, while ensuring coherent behavior, appearance, harmonization, and scene affordance. Unlike prior work, we infer human affordance for video generation (i.e., where to insert a person and how they should behave) from a single scene image, without explicit conditions like bounding boxes or body poses. An in-depth study of cross-attention heatmaps demonstrates that we can uncover the inherent affordance perception of a pre-trained video model without labeled affordance datasets.

Problem

Research questions and friction points this paper is trying to address.

Explore affordance perception in text-to-video models

Generate human-environment interaction from single scene images

Infer human affordance without explicit conditions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tune model for human-environment interaction prediction

Infer human affordance from single scene image

Uncover affordance perception via cross-attention heatmaps

🔎 Similar Papers

No similar papers found.