WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling

📅 2025-12-16

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This paper addresses the longstanding challenge in real-time interactive world modeling: balancing long-term geometric consistency with computational efficiency (i.e., speed and memory). To this end, we propose WorldPlay—a novel framework introducing three core innovations: dual-action representation, reconstruction-based contextual memory, and context-enforced knowledge distillation—integrated with a streaming video diffusion model. Our approach incorporates spatiotemporal refocusing, dynamic context reconstruction, and memory-aware distillation to mitigate error accumulation. WorldPlay achieves end-to-end generation at 720p resolution and 24 FPS on standard hardware, substantially suppressing long-term drift while preserving geometric fidelity. Extensive experiments demonstrate that WorldPlay outperforms state-of-the-art methods across diverse long-horizon scenarios. Notably, it is the first method enabling high-fidelity, low-latency, and generalizable online 3D world interaction—marking a significant step toward practical real-time 3D world modeling.

Technology Category

Application Category

📝 Abstract

This paper presents WorldPlay, a streaming video diffusion model that enables real-time, interactive world modeling with long-term geometric consistency, resolving the trade-off between speed and memory that limits current methods. WorldPlay draws power from three key innovations. 1) We use a Dual Action Representation to enable robust action control in response to the user's keyboard and mouse inputs. 2) To enforce long-term consistency, our Reconstituted Context Memory dynamically rebuilds context from past frames and uses temporal reframing to keep geometrically important but long-past frames accessible, effectively alleviating memory attenuation. 3) We also propose Context Forcing, a novel distillation method designed for memory-aware model. Aligning memory context between the teacher and student preserves the student's capacity to use long-range information, enabling real-time speeds while preventing error drift. Taken together, WorldPlay generates long-horizon streaming 720p video at 24 FPS with superior consistency, comparing favorably with existing techniques and showing strong generalization across diverse scenes. Project page and online demo can be found: https://3d-models.hunyuan.tencent.com/world/ and https://3d.hunyuan.tencent.com/sceneTo3D.

Problem

Research questions and friction points this paper is trying to address.

Enables real-time interactive world modeling with long-term geometric consistency

Resolves the trade-off between speed and memory in current methods

Generates long-horizon streaming 720p video at 24 FPS consistently

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual Action Representation for robust user input control

Reconstituted Context Memory for long-term geometric consistency

Context Forcing distillation for memory-aware real-time modeling

🔎 Similar Papers

No similar papers found.