Keystep Recognition using Graph Neural Networks

📅 2025-06-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses keystep recognition in first-person videos by proposing a novel dynamic heterogeneous graph-based node classification paradigm. Methodologically: (1) it constructs a sparse temporal graph where video segments serve as nodes, integrates third-person video alignment supervision with automatic caption semantics, and models captions as scalable semantic nodes; (2) it introduces a cross-perspective (ego/exo) feature alignment mechanism and a multi-source graph fusion strategy to enhance long-range temporal dependency modeling. Key contributions include: the first formalization of keystep recognition as dynamic heterogeneous graph node classification; a cross-perspective alignment–enhanced training framework; and the incorporation of lightweight caption-derived semantic nodes to improve generalization. Evaluated on the Ego-Exo4D dataset, our method achieves state-of-the-art performance with fewer parameters, faster inference, and superior long-range temporal modeling capability compared to existing approaches.

Technology Category

Application Category

📝 Abstract
We pose keystep recognition as a node classification task, and propose a flexible graph-learning framework for fine-grained keystep recognition that is able to effectively leverage long-term dependencies in egocentric videos. Our approach, termed GLEVR, consists of constructing a graph where each video clip of the egocentric video corresponds to a node. The constructed graphs are sparse and computationally efficient, outperforming existing larger models substantially. We further leverage alignment between egocentric and exocentric videos during training for improved inference on egocentric videos, as well as adding automatic captioning as an additional modality. We consider each clip of each exocentric video (if available) or video captions as additional nodes during training. We examine several strategies to define connections across these nodes. We perform extensive experiments on the Ego-Exo4D dataset and show that our proposed flexible graph-based framework notably outperforms existing methods.
Problem

Research questions and friction points this paper is trying to address.

Recognize keysteps in egocentric videos using graph neural networks
Leverage long-term dependencies and multi-modal data for improved recognition
Outperform existing methods with sparse, efficient graph-based framework
Innovation

Methods, ideas, or system contributions that make the work stand out.

Graph neural networks for keystep recognition
Sparse graph construction for efficiency
Multimodal training with video alignment
🔎 Similar Papers
No similar papers found.