🤖 AI Summary
Existing 4D human-scene interaction (HSI) methods heavily rely on costly, paired 3D scene reconstructions and motion-capture (MoCap) data, limiting scalability and generalization. This paper proposes a zero-shot 4D HSI generation framework that requires no ground-truth MoCap supervision. We introduce the first approach to transfer motion priors from large-scale video diffusion models to HSI, integrating NeRF-based human modeling with differentiable neural rendering to synthesize photorealistic, semantics-driven interactions in both static and dynamic scenes. The method supports cross-modal prompting (e.g., text, pose, or sketch) and operates robustly across complex indoor and outdoor environments. Extensive qualitative and quantitative evaluations on multi-scene benchmarks demonstrate significant improvements over supervised baselines—achieving superior visual quality, motion diversity, and contextual coherence. Our work establishes a new, efficient, and scalable paradigm for 4D interactive content generation, with broad implications for AI, VR, and embodied intelligence.
📝 Abstract
Human-scene interaction (HSI) generation is crucial for applications in embodied AI, virtual reality, and robotics. While existing methods can synthesize realistic human motions in 3D scenes and generate plausible human-object interactions, they heavily rely on datasets containing paired 3D scene and motion capture data, which are expensive and time-consuming to collect across diverse environments and interactions. We present ZeroHSI, a novel approach that enables zero-shot 4D human-scene interaction synthesis by integrating video generation and neural human rendering. Our key insight is to leverage the rich motion priors learned by state-of-the-art video generation models, which have been trained on vast amounts of natural human movements and interactions, and use differentiable rendering to reconstruct human-scene interactions. ZeroHSI can synthesize realistic human motions in both static scenes and environments with dynamic objects, without requiring any ground-truth motion data. We evaluate ZeroHSI on a curated dataset of different types of various indoor and outdoor scenes with different interaction prompts, demonstrating its ability to generate diverse and contextually appropriate human-scene interactions.