HoliGS: Holistic Gaussian Splatting for Embodied View Synthesis

📅 2025-06-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses free-viewpoint synthesis from monocular long-duration RGB video sequences of dynamic scenes. We propose the Deformable Gaussian Lattice framework, which explicitly decouples static background and dynamic foreground—first achieving such explicit separation in this setting. Our method introduces an invertible Gaussian deformation network coupled with a hierarchical deformation strategy to jointly model rigid motion, skeletal articulation, and non-rigid deformation. By integrating differentiable Gaussian rendering with neural flow modeling, the framework enables efficient and stable novel-view synthesis. Evaluated on multiple dynamic scene benchmarks, our approach significantly outperforms state-of-the-art methods in reconstruction quality, training speed, and rendering efficiency. Notably, it demonstrates strong robustness and scalability in large-scale, complex interaction scenarios. This work establishes a new paradigm for real-time dynamic environment reconstruction in embodied intelligence applications.

Technology Category

Application Category

📝 Abstract
We propose HoliGS, a novel deformable Gaussian splatting framework that addresses embodied view synthesis from long monocular RGB videos. Unlike prior 4D Gaussian splatting and dynamic NeRF pipelines, which struggle with training overhead in minute-long captures, our method leverages invertible Gaussian Splatting deformation networks to reconstruct large-scale, dynamic environments accurately. Specifically, we decompose each scene into a static background plus time-varying objects, each represented by learned Gaussian primitives undergoing global rigid transformations, skeleton-driven articulation, and subtle non-rigid deformations via an invertible neural flow. This hierarchical warping strategy enables robust free-viewpoint novel-view rendering from various embodied camera trajectories by attaching Gaussians to a complete canonical foreground shape (eg, egocentric or third-person follow), which may involve substantial viewpoint changes and interactions between multiple actors. Our experiments demonstrate that ourmethod~ achieves superior reconstruction quality on challenging datasets while significantly reducing both training and rendering time compared to state-of-the-art monocular deformable NeRFs. These results highlight a practical and scalable solution for EVS in real-world scenarios. The source code will be released.
Problem

Research questions and friction points this paper is trying to address.

Embodied view synthesis from long monocular RGB videos
Reconstructing large-scale dynamic environments accurately
Reducing training and rendering time for deformable scenes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Invertible Gaussian Splatting deformation networks
Hierarchical warping strategy for dynamic scenes
Static background plus time-varying objects decomposition
🔎 Similar Papers
No similar papers found.