Seeing World Dynamics in a Nutshell

📅 2025-02-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Monocular video-based dynamic 3D reconstruction faces challenges from complex motion, occlusions, and spatiotemporal geometric inconsistency. Method: We propose Spatiotemporal-Aligned Gaussians (STAG), a continuous, optimization-free spatiotemporal Gaussian flow representation. STAG generates dynamic 3D Gaussian scenes via a single forward pass of neural rendering, jointly regularized by implicit depth and optical flow to enforce geometric consistency and motion coherence. Contribution/Results: STAG is the first to model videos as parameterized, continuous spatiotemporal Gaussian primitives—eliminating frame-by-frame optimization. Experiments demonstrate state-of-the-art reconstruction quality, real-time inference capability, and significantly improved geometric and motion fidelity for downstream tasks—including novel-view synthesis and temporal editing—under strong motion and severe occlusion.

Technology Category

Application Category

📝 Abstract
We consider the problem of efficiently representing casually captured monocular videos in a spatially- and temporally-coherent manner. While existing approaches predominantly rely on 2D/2.5D techniques treating videos as collections of spatiotemporal pixels, they struggle with complex motions, occlusions, and geometric consistency due to absence of temporal coherence and explicit 3D structure. Drawing inspiration from monocular video as a projection of the dynamic 3D world, we explore representing videos in their intrinsic 3D form through continuous flows of Gaussian primitives in space-time. In this paper, we propose NutWorld, a novel framework that efficiently transforms monocular videos into dynamic 3D Gaussian representations in a single forward pass. At its core, NutWorld introduces a structured spatial-temporal aligned Gaussian (STAG) representation, enabling optimization-free scene modeling with effective depth and flow regularization. Through comprehensive experiments, we demonstrate that NutWorld achieves high-fidelity video reconstruction quality while enabling various downstream applications in real-time. Demos and code will be available at https://github.com/Nut-World/NutWorld.
Problem

Research questions and friction points this paper is trying to address.

Efficient monocular video representation
Dynamic 3D Gaussian modeling
Real-time high-fidelity reconstruction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Monocular video to 3D
Dynamic Gaussian representations
Optimization-free scene modeling
🔎 Similar Papers
No similar papers found.