🤖 AI Summary
This work addresses the limitation of existing first-person hand–object interaction world models, which rely on future object states for conditional video generation and thus cannot serve as realistic simulators for embodied intelligence. The authors propose EgoHOI, an end-to-end egocentric world model that generates photorealistic and contact-consistent hand–object interaction sequences using only user action signals as input. Its key innovation lies in a physics-guided embedding mechanism that integrates 3D geometric and kinematic priors into the generative process, enabling physically plausible interaction simulation without access to future state information. Experiments on the HOT3D dataset demonstrate that EgoHOI significantly outperforms strong baselines, and ablation studies confirm the effectiveness of the physics-guided design.
📝 Abstract
To serve as a scalable data source for embodied AI, world models should act as true simulators that infer interaction dynamics strictly from user actions, rather than mere conditional video generators relying on privileged future object states. In this context, egocentric Human-Object Interaction (HOI) world models are critical for predicting physically grounded first-person rollouts. However, building such models is profoundly challenging due to rapid head motions, severe occlusions, and high-DoF hand articulations that abruptly alter contact topologies. Consequently, existing approaches often circumvent these physics challenges by resorting to conditional video generation with access to known future object trajectories. We introduce EgoHOI, an egocentric HOI world model that breaks away from this shortcut to simulate photorealistic, contact-consistent interactions from action signals alone. To ensure physical accuracy without future-state inputs, EgoHOI distills geometric and kinematic priors from 3D estimates into physics-informed embeddings. These embeddings regularize the egocentric rollouts toward physically valid dynamics. Experiments on the HOT3D dataset demonstrate consistent gains over strong baselines, and ablations validate the effectiveness of our physics-informed design.