Generated Reality: Human-centric World Simulation using Interactive Video Generation with Hand and Camera Control

📅 2026-02-20

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This work addresses the limitations of existing video generation models, which rely on coarse-grained control signals such as text or keyboard inputs and thus struggle to support high-fidelity interaction in extended reality (XR) driven by users’ actual movements. To overcome this, the authors propose a human-centric video world model that innovatively integrates 3D head pose and joint-level hand articulation as fine-grained control signals. Building upon a diffusion Transformer architecture, they develop a bidirectional video diffusion teacher model, which is subsequently distilled into a causal, real-time interactive system. Experimental results demonstrate that the proposed approach significantly improves task execution efficiency and enhances users’ perceived precision in motion control, outperforming current state-of-the-art baselines.

Technology Category

Application Category

📝 Abstract

Extended reality (XR) demands generative models that respond to users'tracked real-world motion, yet current video world models accept only coarse control signals such as text or keyboard input, limiting their utility for embodied interaction. We introduce a human-centric video world model that is conditioned on both tracked head pose and joint-level hand poses. For this purpose, we evaluate existing diffusion transformer conditioning strategies and propose an effective mechanism for 3D head and hand control, enabling dexterous hand--object interactions. We train a bidirectional video diffusion model teacher using this strategy and distill it into a causal, interactive system that generates egocentric virtual environments. We evaluate this generated reality system with human subjects and demonstrate improved task performance as well as a significantly higher level of perceived amount of control over the performed actions compared with relevant baselines.

Problem

Research questions and friction points this paper is trying to address.

extended reality

video generation

human-centric interaction

hand pose control

embodied interaction

Innovation

Methods, ideas, or system contributions that make the work stand out.

human-centric simulation

interactive video generation

hand pose conditioning