🤖 AI Summary
Existing digital twin techniques struggle to model physically grounded, dexterous human interactions with static 3D scenes. This paper introduces the first scene- and action-conditioned video diffusion model, which jointly synthesizes spatiotemporally coherent and physically plausible interaction videos from two inputs: a static 3D scene render and a first-person hand mesh sequence. Our contributions are threefold: (1) the first application of video diffusion models to action-driven dynamic scene generation; (2) a dual-conditioning mechanism integrating scene rendering and hand geometry to ensure spatial consistency and motion fidelity; and (3) the first large-scale egocentric human-object interaction video dataset, combining synthetic and real-world data. Experiments demonstrate significant improvements in physical plausibility and visual realism for fine-grained manipulation tasks—including grasping, opening, and relocating objects—achieving state-of-the-art performance.
📝 Abstract
Recent progress in 3D reconstruction has made it easy to create realistic digital twins from everyday environments. However, current digital twins remain largely static and are limited to navigation and view synthesis without embodied interactivity. To bridge this gap, we introduce Dexterous World Model (DWM), a scene-action-conditioned video diffusion framework that models how dexterous human actions induce dynamic changes in static 3D scenes.
Given a static 3D scene rendering and an egocentric hand motion sequence, DWM generates temporally coherent videos depicting plausible human-scene interactions. Our approach conditions video generation on (1) static scene renderings following a specified camera trajectory to ensure spatial consistency, and (2) egocentric hand mesh renderings that encode both geometry and motion cues to model action-conditioned dynamics directly. To train DWM, we construct a hybrid interaction video dataset. Synthetic egocentric interactions provide fully aligned supervision for joint locomotion and manipulation learning, while fixed-camera real-world videos contribute diverse and realistic object dynamics.
Experiments demonstrate that DWM enables realistic and physically plausible interactions, such as grasping, opening, and moving objects, while maintaining camera and scene consistency. This framework represents a first step toward video diffusion-based interactive digital twins and enables embodied simulation from egocentric actions.