🤖 AI Summary
Existing Gaussian-based video representations struggle to effectively disentangle static backgrounds from dynamic content, leading to inaccurate spatiotemporal deformation modeling. To address this limitation, this work proposes a spatiotemporal hash encoding framework that introduces learnable 2D spatial and 3D temporal hash encodings into Gaussian video representation for the first time, enabling separate modeling of static and dynamic components. The method further incorporates a keyframe-guided strategy for initializing Gaussians, which enhances geometric consistency and mitigates feature aliasing. Combined with 2D Gaussian splatting and a learnable deformation field, the proposed approach achieves significantly improved video reconstruction quality—surpassing existing Gaussian methods by +0.98 dB in PSNR—and demonstrates superior performance on downstream video tasks.
📝 Abstract
2D Gaussian Splatting (2DGS) has recently become a promising paradigm for high-quality video representation. However, existing methods employ content-agnostic or spatio-temporal feature overlapping embeddings to predict canonical Gaussian primitive deformations, which entangles static and dynamic components in videos and prevents modeling their distinct properties effectively. These result in inaccurate predictions for spatio-temporal deformations and unsatisfactory representation quality. To address these problems, this paper proposes a Spatio-Temporal hash encoding framework for Gaussian-based Video representation (STGV). By decomposing video features into learnable 2D spatial and 3D temporal hash encodings, STGV effectively facilitates the learning of motion patterns for dynamic components while maintaining background details for static elements.In addition, we construct a more stable and consistent initial canonical Gaussian representation through a key frame canonical initialization strategy, preventing from feature overlapping and a structurally incoherent geometry representation. Experimental results demonstrate that our method attains better video representation quality (+0.98 PSNR) against other Gaussian-based methods and achieves competitive performance in downstream video tasks.