TGSFormer: Scalable Temporal Gaussian Splatting for Embodied Semantic Scene Completion

📅 2025-11-28

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Existing Gaussian primitive methods for 3D semantic scene completion suffer from redundant initialization and poor scalability to unbounded scenes; while depth-guided approaches improve local modeling, they remain constrained by frame-based buffers and image consistency, limiting temporal scalability. This paper proposes an embodied temporal Gaussian scene completion framework. Its core innovations include: (1) a persistent Gaussian memory mechanism that eliminates reliance on frame buffers and inter-frame image alignment; and (2) a dual-temporal encoder coupled with a confidence-aware voxel fusion module, enabling dynamic alignment of historical–current Gaussian features, compression of redundant primitives, and density-adaptive regulation. Integrating Temporal Gaussian Splatting, confidence-aware cross-attention, and depth-guided initialization, our method achieves state-of-the-art performance on both local and embodied semantic completion benchmarks—delivering higher accuracy, reduced memory footprint, and improved long-term scene completeness with fewer primitives.

Technology Category

Application Category

📝 Abstract

Embodied 3D Semantic Scene Completion (SSC) infers dense geometry and semantics from continuous egocentric observations. Most existing Gaussian-based methods rely on random initialization of many primitives within predefined spatial bounds, resulting in redundancy and poor scalability to unbounded scenes. Recent depth-guided approach alleviates this issue but remains local, suffering from latency and memory overhead as scale increases. To overcome these challenges, we propose TGSFormer, a scalable Temporal Gaussian Splatting framework for embodied SSC. It maintains a persistent Gaussian memory for temporal prediction, without relying on image coherence or frame caches. For temporal fusion, a Dual Temporal Encoder jointly processes current and historical Gaussian features through confidence-aware cross-attention. Subsequently, a Confidence-aware Voxel Fusion module merges overlapping primitives into voxel-aligned representations, regulating density and maintaining compactness. Extensive experiments demonstrate that TGSFormer achieves state-of-the-art results on both local and embodied SSC benchmarks, offering superior accuracy and scalability with significantly fewer primitives while maintaining consistent long-term scene integrity. The code will be released upon acceptance.

Problem

Research questions and friction points this paper is trying to address.

Scalable 3D semantic scene completion from continuous views

Reducing redundancy and memory in unbounded scene reconstruction

Maintaining long-term scene consistency with temporal Gaussian fusion

Innovation

Methods, ideas, or system contributions that make the work stand out.

Persistent Gaussian memory enables temporal prediction without frame caches

Dual Temporal Encoder uses confidence-aware cross-attention for fusion

Confidence-aware Voxel Fusion merges primitives into compact voxel representations

🔎 Similar Papers

Periodic Vibration Gaussian: Dynamic Urban Scene Reconstruction and Real-time Rendering