🤖 AI Summary
This work addresses the challenge of generating interactive 4D (3D + time) scenes from a single static image and text prompts, enabling user-driven real-time visual exploration. We propose a lightweight, web-native editable 4D world model comprising four core components: multimodal input fusion, 4D scene generation, interactive editing, and foveated rendering guided by gaze estimation. Innovatively integrating WebGL with Supersplat for efficient rendering, our framework combines 3D video generation and eye-movement-aware rendering to achieve low-latency, high-fidelity 4D dynamic visualization directly in the browser. Experiments demonstrate significant improvements in temporal coherence, editing responsiveness, and perceptual immersion. To our knowledge, this is the first end-to-end approach that constructs fully interactive 4D environments from a single image and text prompt. The implementation—including source code and an online demo—is publicly released.
📝 Abstract
We introduce 3D4D, an interactive 4D visualization framework that integrates WebGL with Supersplat rendering. It transforms static images and text into coherent 4D scenes through four core modules and employs a foveated rendering strategy for efficient, real-time multi-modal interaction. This framework enables adaptive, user-driven exploration of complex 4D environments. The project page and code are available at https://yunhonghe1021.github.io/NOVA/.