🤖 AI Summary
Existing egocentric action recognition methods decouple verb (action) and noun (object) classification into independent tasks, neglecting their semantic interdependence and resulting in fragmented representations and limited generalization. This paper proposes a unified prompt-learning framework that constructs an interactive, shared prompt pool. It achieves fine-grained verb–noun semantic synergy through pattern-level decomposition and attention-based fusion. To enhance contextual interaction and feature disentanglement, we further introduce prompt selection frequency regularization and knowledge orthogonality constraints. Our method achieves state-of-the-art performance on Ego4D, EPIC-Kitchens, and EGTEA. Notably, it demonstrates significant improvements in cross-dataset generalization and zero-shot recognition of unseen categories—outperforming prior approaches by substantial margins. The framework bridges the semantic gap between actions and objects while preserving modularity and scalability, offering a principled solution to holistic egocentric action understanding.
📝 Abstract
Driven by the increasing demand for applications in augmented and virtual reality, egocentric action recognition has emerged as a prominent research area. It is typically divided into two subtasks: recognizing the performed behavior (i.e., verb component) and identifying the objects being acted upon (i.e., noun component) from the first-person perspective. However, most existing approaches treat these two components as independent classification tasks, focusing on extracting component-specific knowledge while overlooking their inherent semantic and contextual relationships, leading to fragmented representations and sub-optimal generalization capability. To address these challenges, we propose a prompt learning-based framework, EgoPrompt, to conduct the egocentric action recognition task. Building on the existing prompting strategy to capture the component-specific knowledge, we construct a Unified Prompt Pool space to establish interaction between the two types of component representations. Specifically, the component representations (from verbs and nouns) are first decomposed into fine-grained patterns with the prompt pair form. Then, these pattern-level representations are fused through an attention-based mechanism to facilitate cross-component interaction. To ensure the prompt pool is informative, we further introduce a novel training objective, Diverse Pool Criteria. This objective realizes our goals from two perspectives: Prompt Selection Frequency Regularization and Prompt Knowledge Orthogonalization. Extensive experiments are conducted on the Ego4D, EPIC-Kitchens, and EGTEA datasets. The results consistently show that EgoPrompt achieves state-of-the-art performance across within-dataset, cross-dataset, and base-to-novel generalization benchmarks.