Towards Generalizing Temporal Action Segmentation to Unseen Views

📅 2025-04-03

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This work addresses the poor generalization of temporal action segmentation models to unseen camera viewpoints (e.g., third-person frontal → lateral, exocentric → egocentric). To systematically evaluate cross-view generalization, we introduce the first dedicated cross-view action segmentation benchmark protocol. We propose a multi-granularity shared representation framework that jointly models sequences and segments, enforcing cross-view semantic consistency via sequence alignment loss and contrastive action-level loss. Further, we incorporate cross-view consistency regularization and hierarchical temporal modeling to decouple and align video representations with action semantics. Evaluated on Assembly101, IKEA ASM, and EgoExoLearn, our method achieves +12.8% F1@50 improvement on unseen exocentric views and +54.0% on unseen egocentric views—marking substantial progress toward viewpoint-agnostic action segmentation.

Technology Category

Application Category

📝 Abstract

While there has been substantial progress in temporal action segmentation, the challenge to generalize to unseen views remains unaddressed. Hence, we define a protocol for unseen view action segmentation where camera views for evaluating the model are unavailable during training. This includes changing from top-frontal views to a side view or even more challenging from exocentric to egocentric views. Furthermore, we present an approach for temporal action segmentation that tackles this challenge. Our approach leverages a shared representation at both the sequence and segment levels to reduce the impact of view differences during training. We achieve this by introducing a sequence loss and an action loss, which together facilitate consistent video and action representations across different views. The evaluation on the Assembly101, IkeaASM, and EgoExoLearn datasets demonstrate significant improvements, with a 12.8% increase in F1@50 for unseen exocentric views and a substantial 54% improvement for unseen egocentric views.

Problem

Research questions and friction points this paper is trying to address.

Generalizing temporal action segmentation to unseen camera views

Addressing view changes from top-frontal to side or exocentric to egocentric

Improving cross-view consistency in video and action representations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Shared representation for sequence and segment levels

Sequence loss and action loss for consistency

Generalizing to unseen exocentric and egocentric views

🔎 Similar Papers

No similar papers found.