STGV: Spatio-Temporal Hash Encoding for Gaussian-based Video Representation

📅 2026-04-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

237K/year
🤖 AI Summary
Existing Gaussian-based video representations struggle to effectively disentangle static backgrounds from dynamic content, leading to inaccurate spatiotemporal deformation modeling. To address this limitation, this work proposes a spatiotemporal hash encoding framework that introduces learnable 2D spatial and 3D temporal hash encodings into Gaussian video representation for the first time, enabling separate modeling of static and dynamic components. The method further incorporates a keyframe-guided strategy for initializing Gaussians, which enhances geometric consistency and mitigates feature aliasing. Combined with 2D Gaussian splatting and a learnable deformation field, the proposed approach achieves significantly improved video reconstruction quality—surpassing existing Gaussian methods by +0.98 dB in PSNR—and demonstrates superior performance on downstream video tasks.

Technology Category

Application Category

📝 Abstract
2D Gaussian Splatting (2DGS) has recently become a promising paradigm for high-quality video representation. However, existing methods employ content-agnostic or spatio-temporal feature overlapping embeddings to predict canonical Gaussian primitive deformations, which entangles static and dynamic components in videos and prevents modeling their distinct properties effectively. These result in inaccurate predictions for spatio-temporal deformations and unsatisfactory representation quality. To address these problems, this paper proposes a Spatio-Temporal hash encoding framework for Gaussian-based Video representation (STGV). By decomposing video features into learnable 2D spatial and 3D temporal hash encodings, STGV effectively facilitates the learning of motion patterns for dynamic components while maintaining background details for static elements.In addition, we construct a more stable and consistent initial canonical Gaussian representation through a key frame canonical initialization strategy, preventing from feature overlapping and a structurally incoherent geometry representation. Experimental results demonstrate that our method attains better video representation quality (+0.98 PSNR) against other Gaussian-based methods and achieves competitive performance in downstream video tasks.
Problem

Research questions and friction points this paper is trying to address.

Gaussian-based video representation
spatio-temporal deformation
static-dynamic disentanglement
feature overlapping
canonical Gaussian representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatio-Temporal Hash Encoding
Gaussian Splatting
Video Representation
Canonical Initialization
Dynamic-Static Decomposition
🔎 Similar Papers
2024-02-20International Conference on Machine LearningCitations: 30