🤖 AI Summary
This work addresses the challenges of segment length bias and insufficient spatiotemporal modeling in unsupervised skeleton-based action segmentation by proposing a hierarchical spatiotemporal vector quantization framework. The method employs a two-stage vector quantization process: first mapping raw skeleton sequences into fine-grained sub-action units, then aggregating these units into action-level representations. By jointly reconstructing both the skeleton data and their corresponding timestamps, the model enables end-to-end unsupervised spatiotemporal learning. To the best of our knowledge, this is the first effort to introduce hierarchical vector quantization to this task, effectively integrating spatial and temporal cues and substantially mitigating segment length bias. The approach achieves state-of-the-art performance across multiple benchmarks—including HuGaDB, LARa, and BABEL—significantly outperforming non-hierarchical baseline methods.
📝 Abstract
We propose a novel hierarchical spatiotemporal vector quantization framework for unsupervised skeleton-based temporal action segmentation. We first introduce a hierarchical approach, which includes two consecutive levels of vector quantization. Specifically, the lower level associates skeletons with fine-grained subactions, while the higher level further aggregates subactions into action-level representations. Our hierarchical approach outperforms the non-hierarchical baseline, while primarily exploiting spatial cues by reconstructing input skeletons. Next, we extend our approach by leveraging both spatial and temporal information, yielding a hierarchical spatiotemporal vector quantization scheme. In particular, our hierarchical spatiotemporal approach performs multi-level clustering, while simultaneously recovering input skeletons and their corresponding timestamps. Lastly, extensive experiments on multiple benchmarks, including HuGaDB, LARa, and BABEL, demonstrate that our approach establishes a new state-of-the-art performance and reduces segment length bias in unsupervised skeleton-based temporal action segmentation.