🤖 AI Summary
This paper addresses the longstanding challenge in real-time interactive world modeling: balancing long-term geometric consistency with computational efficiency (i.e., speed and memory). To this end, we propose WorldPlay—a novel framework introducing three core innovations: dual-action representation, reconstruction-based contextual memory, and context-enforced knowledge distillation—integrated with a streaming video diffusion model. Our approach incorporates spatiotemporal refocusing, dynamic context reconstruction, and memory-aware distillation to mitigate error accumulation. WorldPlay achieves end-to-end generation at 720p resolution and 24 FPS on standard hardware, substantially suppressing long-term drift while preserving geometric fidelity. Extensive experiments demonstrate that WorldPlay outperforms state-of-the-art methods across diverse long-horizon scenarios. Notably, it is the first method enabling high-fidelity, low-latency, and generalizable online 3D world interaction—marking a significant step toward practical real-time 3D world modeling.
📝 Abstract
This paper presents WorldPlay, a streaming video diffusion model that enables real-time, interactive world modeling with long-term geometric consistency, resolving the trade-off between speed and memory that limits current methods. WorldPlay draws power from three key innovations. 1) We use a Dual Action Representation to enable robust action control in response to the user's keyboard and mouse inputs. 2) To enforce long-term consistency, our Reconstituted Context Memory dynamically rebuilds context from past frames and uses temporal reframing to keep geometrically important but long-past frames accessible, effectively alleviating memory attenuation. 3) We also propose Context Forcing, a novel distillation method designed for memory-aware model. Aligning memory context between the teacher and student preserves the student's capacity to use long-range information, enabling real-time speeds while preventing error drift. Taken together, WorldPlay generates long-horizon streaming 720p video at 24 FPS with superior consistency, comparing favorably with existing techniques and showing strong generalization across diverse scenes. Project page and online demo can be found: https://3d-models.hunyuan.tencent.com/world/ and https://3d.hunyuan.tencent.com/sceneTo3D.