UniSplat: Unified Spatio-Temporal Fusion via 3D Latent Scaffolds for Dynamic Driving Scene Reconstruction

📅 2025-11-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the challenge of 3D reconstruction in autonomous driving—where sparse, non-overlapping camera views coexist with complex dynamic scenes—this paper proposes a unified spatiotemporal fusion framework. Our method introduces three key innovations: (1) a pre-trained foundation model-based 3D latent scaffold serving as a shared geometric prior for dynamic scene modeling; (2) a dual-branch Gaussian decoder that jointly performs point-anchored refinement and voxel-aware generation to enable explicit representation of dynamic objects; and (3) a persistent static Gaussian memory mechanism that ensures cross-view and cross-frame consistency while enabling out-of-field scene completion. Evaluated on real-world datasets, our approach significantly improves novel-view synthesis quality, maintaining robust, high-fidelity rendering even beyond the original camera frustums—achieving state-of-the-art performance.

Technology Category

Application Category

📝 Abstract
Feed-forward 3D reconstruction for autonomous driving has advanced rapidly, yet existing methods struggle with the joint challenges of sparse, non-overlapping camera views and complex scene dynamics. We present UniSplat, a general feed-forward framework that learns robust dynamic scene reconstruction through unified latent spatio-temporal fusion. UniSplat constructs a 3D latent scaffold, a structured representation that captures geometric and semantic scene context by leveraging pretrained foundation models. To effectively integrate information across spatial views and temporal frames, we introduce an efficient fusion mechanism that operates directly within the 3D scaffold, enabling consistent spatio-temporal alignment. To ensure complete and detailed reconstructions, we design a dual-branch decoder that generates dynamic-aware Gaussians from the fused scaffold by combining point-anchored refinement with voxel-based generation, and maintain a persistent memory of static Gaussians to enable streaming scene completion beyond current camera coverage. Extensive experiments on real-world datasets demonstrate that UniSplat achieves state-of-the-art performance in novel view synthesis, while providing robust and high-quality renderings even for viewpoints outside the original camera coverage.
Problem

Research questions and friction points this paper is trying to address.

Reconstructs dynamic driving scenes from sparse non-overlapping cameras
Fuses spatio-temporal information via 3D latent scaffolds
Generates complete reconstructions beyond original camera coverage
Innovation

Methods, ideas, or system contributions that make the work stand out.

3D latent scaffold for unified spatio-temporal fusion
Dual-branch decoder generates dynamic-aware Gaussians
Persistent memory enables streaming scene completion
🔎 Similar Papers
No similar papers found.
C
Chen Shi
The Chinese University of Hong Kong, Shenzhen
Shaoshuai Shi
Shaoshuai Shi
Didi Chuxing, Max Planck Institute for Informatics
Computer VisionDeep LearningAutonomous Driving
Xiaoyang Lyu
Xiaoyang Lyu
The University of Hong Kong; Zhejiang University
Computer visionDepth Estimation
Chunyang Liu
Chunyang Liu
Didi Chuxing
Data MiningMarketplaceAutonomous Driving
K
Kehua Sheng
Voyager Research, Didi Chuxing
B
Bo Zhang
Voyager Research, Didi Chuxing
L
Li Jiang
The Chinese University of Hong Kong, Shenzhen