π€ AI Summary
This work addresses real-time novel view synthesis from sparse-view video streams. We propose the first history-aware, hybrid splat-voxel feedforward reconstruction framework that requires no temporal training. Our method integrates 3D Gaussian splatting with hierarchical sparse voxel grids, incorporating motion-map-driven 3D Gaussian deformation propagation and an error-aware sparse voxel Transformer to achieve inference-time temporal modeling solely from historical frames. Compared to prior approaches, our method significantly improves temporal consistency and visual fidelity, effectively suppressing flickering and artifacts, and achieves state-of-the-art performance on both static and dynamic scenes. On a single H100 GPU, it attains 15 FPS rendering speed with only 350 ms end-to-end latency.
π Abstract
We study the problem of novel view streaming from sparse-view videos, which aims to generate a continuous sequence of high-quality, temporally consistent novel views as new input frames arrive. However, existing novel view synthesis methods struggle with temporal coherence and visual fidelity, leading to flickering and inconsistency. To address these challenges, we introduce history-awareness, leveraging previous frames to reconstruct the scene and improve quality and stability. We propose a hybrid splat-voxel feed-forward scene reconstruction approach that combines Gaussian Splatting to propagate information over time, with a hierarchical voxel grid for temporal fusion. Gaussian primitives are efficiently warped over time using a motion graph that extends 2D tracking models to 3D motion, while a sparse voxel transformer integrates new temporal observations in an error-aware manner. Crucially, our method does not require training on multi-view video datasets, which are currently limited in size and diversity, and can be directly applied to sparse-view video streams in a history-aware manner at inference time. Our approach achieves state-of-the-art performance in both static and streaming scene reconstruction, effectively reducing temporal artifacts and visual artifacts while running at interactive rates (15 fps with 350ms delay) on a single H100 GPU. Project Page: https://19reborn.github.io/SplatVoxel/