🤖 AI Summary
Existing temporal action segmentation methods struggle to explicitly model the hierarchical structure inherent in human activities. To address this limitation, this work proposes HybridTAS, a novel framework that introduces hyperbolic geometry into the denoising process of diffusion models for the first time. By fusing representations from both Euclidean and hyperbolic spaces, HybridTAS progressively refines segmentation outputs during denoising—from coarse-grained high-level categories to fine-grained specific actions—thereby explicitly capturing hierarchical dependencies among actions. The method achieves state-of-the-art performance on three standard benchmarks: GTE-A, 50Salads, and Breakfast, demonstrating the effectiveness and advantages of hyperbolic-guided denoising for temporal action segmentation.
📝 Abstract
Temporal action segmentation is a critical task in video understanding, where the goal is to assign action labels to each frame in a video. While recent advances leverage iterative refinement-based strategies, they fail to explicitly utilize the hierarchical nature of human actions. In this work, we propose HybridTAS - a novel framework that incorporates a hybrid of Euclidean and hyperbolic geometries into the denoising process of diffusion models to exploit the hierarchical structure of actions. Hyperbolic geometry naturally provides tree-like relationships between embeddings, enabling us to guide the action label denoising process in a coarse-to-fine manner: higher diffusion timesteps are influenced by abstract, high-level action categories (root nodes), while lower timesteps are refined using fine-grained action classes (leaf nodes). Extensive experiments on three benchmark datasets, GTEA, 50Salads, and Breakfast, demonstrate that our method achieves state-of-the-art performance, validating the effectiveness of hyperbolic-guided denoising for the temporal action segmentation task.