RefAtomNet++: Advancing Referring Atomic Video Action Recognition using Semantic Retrieval based Multi-Trajectory Mamba

📅 2025-10-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The RAVAR task aims to precisely localize a target person and recognize their fine-grained atomic actions in complex multi-person videos conditioned on natural language descriptions, with core challenges lying in cross-modal alignment and joint person-action localization. To address these, we propose a novel cross-modal modeling paradigm: (1) a multi-level semantic-aligned cross-attention mechanism; (2) a semantic-retrieval-guided multi-trajectory Mamba model that dynamically constructs keyword- and scene-attribute-driven spatial scanning paths; and (3) cross-modal token aggregation and dynamic spatial token selection, integrating word-level, attribute-level, and sentence-level information. Evaluated on RefAVA++, our method achieves new state-of-the-art performance, significantly outperforming existing approaches. This work advances referential expression-driven video action recognition by enabling robust, interpretable, and semantically grounded spatiotemporal grounding.

Technology Category

Application Category

📝 Abstract
Referring Atomic Video Action Recognition (RAVAR) aims to recognize fine-grained, atomic-level actions of a specific person of interest conditioned on natural language descriptions. Distinct from conventional action recognition and detection tasks, RAVAR emphasizes precise language-guided action understanding, which is particularly critical for interactive human action analysis in complex multi-person scenarios. In this work, we extend our previously introduced RefAVA dataset to RefAVA++, which comprises >2.9 million frames and >75.1k annotated persons in total. We benchmark this dataset using baselines from multiple related domains, including atomic action localization, video question answering, and text-video retrieval, as well as our earlier model, RefAtomNet. Although RefAtomNet surpasses other baselines by incorporating agent attention to highlight salient features, its ability to align and retrieve cross-modal information remains limited, leading to suboptimal performance in localizing the target person and predicting fine-grained actions. To overcome the aforementioned limitations, we introduce RefAtomNet++, a novel framework that advances cross-modal token aggregation through a multi-hierarchical semantic-aligned cross-attention mechanism combined with multi-trajectory Mamba modeling at the partial-keyword, scene-attribute, and holistic-sentence levels. In particular, scanning trajectories are constructed by dynamically selecting the nearest visual spatial tokens at each timestep for both partial-keyword and scene-attribute levels. Moreover, we design a multi-hierarchical semantic-aligned cross-attention strategy, enabling more effective aggregation of spatial and temporal tokens across different semantic hierarchies. Experiments show that RefAtomNet++ establishes new state-of-the-art results. The dataset and code are released at https://github.com/KPeng9510/refAVA2.
Problem

Research questions and friction points this paper is trying to address.

Recognizing fine-grained atomic actions of specific persons using language descriptions
Improving cross-modal alignment between visual content and textual queries
Enhancing person localization and action prediction in complex multi-person scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-hierarchical semantic-aligned cross-attention for token aggregation
Multi-trajectory Mamba modeling across partial-keyword and scene-attribute levels
Dynamic selection of nearest visual tokens for scanning trajectories
🔎 Similar Papers
No similar papers found.