🤖 AI Summary
Existing methods struggle to effectively model the geometric dynamics inherent in 4D point cloud videos. This work proposes a unified framework grounded in graph spectral signal processing, treating 4D point cloud videos for the first time as multi-band graph signals. By leveraging the graph Fourier transform, the approach decomposes the signal into low-frequency components corresponding to coarse shape structures and high-frequency components capturing fine-grained geometric details. A hybrid architecture is designed to jointly exploit spatial, temporal, and spectral-domain information. The method achieves significant performance gains over state-of-the-art approaches on multiple benchmarks, including 3D action recognition and 4D semantic segmentation, thereby demonstrating the efficacy and superiority of a spectral perspective for understanding 4D geometric dynamics.
📝 Abstract
4D point cloud videos capture rich spatial and temporal dynamics of scenes which possess unique values in various 4D understanding tasks. However, most existing methods work in the spatiotemporal domain where the underlying geometric characteristics of 4D point cloud videos are hard to capture, leading to degraded representation learning and understanding of 4D point cloud videos. We address the above challenge from a complementary spectral perspective. By transforming 4D point cloud videos into graph spectral signals, we can decompose them into multiple frequency bands each of which captures distinct geometric structures of point cloud videos. Our spectral analysis reveals that the decomposed low-frequency signals capture more coarse shapes while high-frequency signals encode more fine-grained geometry details. Building on these observations, we design Spatio-Temporal-Spectral Mixer (STS-Mixer), a unified framework that mixes spatial, temporal, and spectral representations of point cloud videos. STS-Mixer integrates multi-band delineated spectral signals with spatiotemporal information to capture rich geometries and temporal dynamics, while enabling fine-grained and holistic understanding of 4D point cloud videos. Extensive experiments show that STS-Mixer achieves superior performance consistently across multiple widely adopted benchmarks on both 3D action recognition and 4D semantic segmentation tasks. Code and models are available at https://github.com/Vegetebird/STS-Mixer.