Enhancing Action Recognition by Leveraging the Hierarchical Structure of Actions and Textual Context

πŸ“… 2024-10-28
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 1
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the neglect of hierarchical structure and sequential context in action recognition, this paper proposes a hierarchy-aware multimodal Transformer architecture integrating visual (RGB/optical flow) and textual (spatial location, preceding actions) features. Methodologically, it explicitly models coarse-to-fine-grained action hierarchies for the first time; incorporates realistic textual contextual cues; and introduces a hierarchical joint loss function to jointly optimize two-level classification. Contributions include: (1) constructing Hierarchical TSUβ€”the first benchmark dataset with fine-grained hierarchical action annotations; and (2) establishing a context-aware action modeling paradigm. Experiments demonstrate state-of-the-art performance under identical hyperparameters: top-1 accuracy improves by 17.12% using ground-truth context and by 5.33% using predicted context over pretrained SOTA baselines.

Technology Category

Application Category

πŸ“ Abstract
The sequential execution of actions and their hierarchical structure consisting of different levels of abstraction, provide features that remain unexplored in the task of action recognition. In this study, we present a novel approach to improve action recognition by exploiting the hierarchical organization of actions and by incorporating contextualized textual information, including location and prior actions to reflect the sequential context. To achieve this goal, we introduce a novel transformer architecture tailored for action recognition that utilizes both visual and textual features. Visual features are obtained from RGB and optical flow data, while text embeddings represent contextual information. Furthermore, we define a joint loss function to simultaneously train the model for both coarse and fine-grained action recognition, thereby exploiting the hierarchical nature of actions. To demonstrate the effectiveness of our method, we extend the Toyota Smarthome Untrimmed (TSU) dataset to introduce action hierarchies, introducing the Hierarchical TSU dataset. We also conduct an ablation study to assess the impact of different methods for integrating contextual and hierarchical data on action recognition performance. Results show that the proposed approach outperforms pre-trained SOTA methods when trained with the same hyperparameters. Moreover, they also show a 17.12% improvement in top-1 accuracy over the equivalent fine-grained RGB version when using ground-truth contextual information, and a 5.33% improvement when contextual information is obtained from actual predictions.
Problem

Research questions and friction points this paper is trying to address.

Improving action recognition using hierarchical action structures
Incorporating contextual text information for temporal understanding
Developing transformer architecture with visual and textual features
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer architecture integrates visual and textual features
Joint loss function trains coarse- and fine-grained recognition
Hierarchical dataset extends action recognition with contextual data
πŸ”Ž Similar Papers
No similar papers found.