🤖 AI Summary
Addressing the challenge of detecting and segmenting out-of-distribution (OOD) actions in open-world action segmentation, this paper proposes the first end-to-end framework. It introduces an enhanced pyramid graph convolutional network to model multi-scale spatiotemporal dependencies; a Mixup-based unlabeled anomalous action synthesis strategy to mitigate OOD sample scarcity; and a temporal clustering loss jointly optimizing action segmentation and open-set recognition. Evaluated on Bimanual Actions and H2O datasets, our method achieves significant improvements: +16.9% in open-set segmentation F1@50 and +34.6% in OOD detection AUROC. These results demonstrate strong generalization under dynamic real-world conditions and establish a novel paradigm for applications such as assistive robotics and healthcare.
📝 Abstract
Human-object interaction segmentation is a fundamental task of daily activity understanding, which plays a crucial role in applications such as assistive robotics, healthcare, and autonomous systems. Most existing learning-based methods excel in closed-world action segmentation, they struggle to generalize to open-world scenarios where novel actions emerge. Collecting exhaustive action categories for training is impractical due to the dynamic diversity of human activities, necessitating models that detect and segment out-of-distribution actions without manual annotation. To address this issue, we formally define the open-world action segmentation problem and propose a structured framework for detecting and segmenting unseen actions. Our framework introduces three key innovations: 1) an Enhanced Pyramid Graph Convolutional Network (EPGCN) with a novel decoder module for robust spatiotemporal feature upsampling. 2) Mixup-based training to synthesize out-of-distribution data, eliminating reliance on manual annotations. 3) A novel Temporal Clustering loss that groups in-distribution actions while distancing out-of-distribution samples.
We evaluate our framework on two challenging human-object interaction recognition datasets: Bimanual Actions and 2 Hands and Object (H2O) datasets. Experimental results demonstrate significant improvements over state-of-the-art action segmentation models across multiple open-set evaluation metrics, achieving 16.9% and 34.6% relative gains in open-set segmentation (F1@50) and out-of-distribution detection performances (AUROC), respectively. Additionally, we conduct an in-depth ablation study to assess the impact of each proposed component, identifying the optimal framework configuration for open-world action segmentation.