🤖 AI Summary
This work addresses the challenges of representation inconsistency in 4D point cloud videos caused by temporal scale variations due to frame-rate changes and uncertainty in point distribution. To tackle these issues, the authors propose GATS, a dual-invariance framework that jointly models point cloud distribution uncertainty and temporal scale invariance for the first time. The method employs Uncertainty-Guided Gaussian Convolution (UGGC) to capture local geometric statistics and integrates a learnable Temporal Scaling Attention (TSA) module for adaptive temporal normalization. Evaluated on MSR-Action3D, NTU RGB+D, and Synthia4D datasets, GATS achieves accuracy improvements of 6.62%, 1.4%, and a 1.8% gain in mIoU, respectively, significantly outperforming existing Transformer-based approaches and demonstrating robustness and generalization under varying point densities, noise, occlusions, and temporal rate discrepancies.
📝 Abstract
Understanding 4D point cloud videos is essential for enabling intelligent agents to perceive dynamic environments. However, temporal scale bias across varying frame rates and distributional uncertainty in irregular point clouds make it highly challenging to design a unified and robust 4D backbone. Existing CNN or Transformer based methods are constrained either by limited receptive fields or by quadratic computational complexity, while neglecting these implicit distortions. To address this problem, we propose a novel dual invariant framework, termed \textbf{Gaussian Aware Temporal Scaling (GATS)}, which explicitly resolves both distributional inconsistencies and temporal. The proposed \emph{Uncertainty Guided Gaussian Convolution (UGGC)} incorporates local Gaussian statistics and uncertainty aware gating into point convolution, thereby achieving robust neighborhood aggregation under density variation, noise, and occlusion. In parallel, the \emph{Temporal Scaling Attention (TSA)} introduces a learnable scaling factor to normalize temporal distances, ensuring frame partition invariance and consistent velocity estimation across different frame rates. These two modules are complementary: temporal scaling normalizes time intervals prior to Gaussian estimation, while Gaussian modeling enhances robustness to irregular distributions. Our experiments on mainstream benchmarks MSR-Action3D (\textbf{+6.62\%} accuracy), NTU RGBD (\textbf{+1.4\%} accuracy), and Synthia4D (\textbf{+1.8\%} mIoU) demonstrate significant performance gains, offering a more efficient and principled paradigm for invariant 4D point cloud video understanding with superior accuracy, robustness, and scalability compared to Transformer based counterparts.