Domain Generalization using Action Sequences for Egocentric Action Recognition

📅 2025-06-21

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

To address the weak cross-environment generalization capability in first-person vision action recognition, this paper proposes SeqDG, a sequence-level domain generalization method. Our approach tackles the problem by (1) designing a vision–text multimodal sequence reconstruction objective (SeqRec) to model semantic consistency between actions and user intent; (2) introducing a cross-domain action sequence mixing augmentation strategy (SeqMix) to enhance robustness to unseen environments; and (3) formulating a context-aware sequence reconstruction loss to enforce temporal semantic alignment. Evaluated under the EPIC-KITCHENS-100 cross-domain setting, SeqDG achieves a relative improvement of 2.4% over prior methods. On EGTEA, it attains a Top-1 accuracy surpassing the state-of-the-art by 0.6%. These results demonstrate significantly improved generalization performance in previously unseen environments.

Technology Category

Application Category

📝 Abstract

Recognizing human activities from visual inputs, particularly through a first-person viewpoint, is essential for enabling robots to replicate human behavior. Egocentric vision, characterized by cameras worn by observers, captures diverse changes in illumination, viewpoint, and environment. This variability leads to a notable drop in the performance of Egocentric Action Recognition models when tested in environments not seen during training. In this paper, we tackle these challenges by proposing a domain generalization approach for Egocentric Action Recognition. Our insight is that action sequences often reflect consistent user intent across visual domains. By leveraging action sequences, we aim to enhance the model's generalization ability across unseen environments. Our proposed method, named SeqDG, introduces a visual-text sequence reconstruction objective (SeqRec) that uses contextual cues from both text and visual inputs to reconstruct the central action of the sequence. Additionally, we enhance the model's robustness by training it on mixed sequences of actions from different domains (SeqMix). We validate SeqDG on the EGTEA and EPIC-KITCHENS-100 datasets. Results on EPIC-KITCHENS-100, show that SeqDG leads to +2.4% relative average improvement in cross-domain action recognition in unseen environments, and on EGTEA the model achieved +0.6% Top-1 accuracy over SOTA in intra-domain action recognition.

Problem

Research questions and friction points this paper is trying to address.

Improving egocentric action recognition across unseen environments

Enhancing model generalization using action sequences

Addressing performance drop due to environmental variability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Domain generalization via action sequences

Visual-text sequence reconstruction objective

Training on mixed sequences from domains

🔎 Similar Papers

No similar papers found.