Hierarchical Vector Quantization for Unsupervised Action Segmentation

📅 2024-12-23
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses unsupervised temporal action segmentation—partitioning untrimmed long videos into semantically consistent action segments across videos, particularly under substantial intra-class temporal variability. To this end, we propose a Hierarchical Vector Quantization (HVQ) framework that explicitly enforces hierarchical clustering via two-level vector quantization and jointly optimizes representation learning with clustering objectives. We further introduce the Jensen–Shannon Divergence (JSD) as a novel evaluation metric, offering a more precise quantification of semantic consistency in segmentation outputs. Extensive experiments on Breakfast, YouTube Instructional, and IKEA ASM benchmarks demonstrate that HVQ achieves state-of-the-art performance, surpassing prior methods in F1-score, recall, and JSD. These results validate HVQ’s effectiveness in modeling intra-action variability and enhancing segmentation robustness.

Technology Category

Application Category

📝 Abstract
In this work, we address unsupervised temporal action segmentation, which segments a set of long, untrimmed videos into semantically meaningful segments that are consistent across videos. While recent approaches combine representation learning and clustering in a single step for this task, they do not cope with large variations within temporal segments of the same class. To address this limitation, we propose a novel method, termed Hierarchical Vector Quantization (HVQ), that consists of two subsequent vector quantization modules. This results in a hierarchical clustering where the additional subclusters cover the variations within a cluster. We demonstrate that our approach captures the distribution of segment lengths much better than the state of the art. To this end, we introduce a new metric based on the Jensen-Shannon Distance (JSD) for unsupervised temporal action segmentation. We evaluate our approach on three public datasets, namely Breakfast, YouTube Instructional and IKEA ASM. Our approach outperforms the state of the art in terms of F1 score, recall and JSD.
Problem

Research questions and friction points this paper is trying to address.

Video Segmentation
Action Recognition
Accuracy Improvement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Vector Quantization
Jensen-Shannon Distance
Video Segmentation
🔎 Similar Papers
No similar papers found.