π€ AI Summary
This work addresses the limitations of existing video generation models, which rely on coarse-grained control signals such as text or keyboard inputs and thus struggle to support high-fidelity interaction in extended reality (XR) driven by usersβ actual movements. To overcome this, the authors propose a human-centric video world model that innovatively integrates 3D head pose and joint-level hand articulation as fine-grained control signals. Building upon a diffusion Transformer architecture, they develop a bidirectional video diffusion teacher model, which is subsequently distilled into a causal, real-time interactive system. Experimental results demonstrate that the proposed approach significantly improves task execution efficiency and enhances usersβ perceived precision in motion control, outperforming current state-of-the-art baselines.
π Abstract
Extended reality (XR) demands generative models that respond to users'tracked real-world motion, yet current video world models accept only coarse control signals such as text or keyboard input, limiting their utility for embodied interaction. We introduce a human-centric video world model that is conditioned on both tracked head pose and joint-level hand poses. For this purpose, we evaluate existing diffusion transformer conditioning strategies and propose an effective mechanism for 3D head and hand control, enabling dexterous hand--object interactions. We train a bidirectional video diffusion model teacher using this strategy and distill it into a causal, interactive system that generates egocentric virtual environments. We evaluate this generated reality system with human subjects and demonstrate improved task performance as well as a significantly higher level of perceived amount of control over the performed actions compared with relevant baselines.