🤖 AI Summary
This work addresses the poor generalization of temporal action segmentation models to unseen camera viewpoints (e.g., third-person frontal → lateral, exocentric → egocentric). To systematically evaluate cross-view generalization, we introduce the first dedicated cross-view action segmentation benchmark protocol. We propose a multi-granularity shared representation framework that jointly models sequences and segments, enforcing cross-view semantic consistency via sequence alignment loss and contrastive action-level loss. Further, we incorporate cross-view consistency regularization and hierarchical temporal modeling to decouple and align video representations with action semantics. Evaluated on Assembly101, IKEA ASM, and EgoExoLearn, our method achieves +12.8% F1@50 improvement on unseen exocentric views and +54.0% on unseen egocentric views—marking substantial progress toward viewpoint-agnostic action segmentation.
📝 Abstract
While there has been substantial progress in temporal action segmentation, the challenge to generalize to unseen views remains unaddressed. Hence, we define a protocol for unseen view action segmentation where camera views for evaluating the model are unavailable during training. This includes changing from top-frontal views to a side view or even more challenging from exocentric to egocentric views. Furthermore, we present an approach for temporal action segmentation that tackles this challenge. Our approach leverages a shared representation at both the sequence and segment levels to reduce the impact of view differences during training. We achieve this by introducing a sequence loss and an action loss, which together facilitate consistent video and action representations across different views. The evaluation on the Assembly101, IkeaASM, and EgoExoLearn datasets demonstrate significant improvements, with a 12.8% increase in F1@50 for unseen exocentric views and a substantial 54% improvement for unseen egocentric views.