🤖 AI Summary
This paper addresses keystep recognition in first-person videos by proposing a novel dynamic heterogeneous graph-based node classification paradigm. Methodologically: (1) it constructs a sparse temporal graph where video segments serve as nodes, integrates third-person video alignment supervision with automatic caption semantics, and models captions as scalable semantic nodes; (2) it introduces a cross-perspective (ego/exo) feature alignment mechanism and a multi-source graph fusion strategy to enhance long-range temporal dependency modeling. Key contributions include: the first formalization of keystep recognition as dynamic heterogeneous graph node classification; a cross-perspective alignment–enhanced training framework; and the incorporation of lightweight caption-derived semantic nodes to improve generalization. Evaluated on the Ego-Exo4D dataset, our method achieves state-of-the-art performance with fewer parameters, faster inference, and superior long-range temporal modeling capability compared to existing approaches.
📝 Abstract
We pose keystep recognition as a node classification task, and propose a flexible graph-learning framework for fine-grained keystep recognition that is able to effectively leverage long-term dependencies in egocentric videos. Our approach, termed GLEVR, consists of constructing a graph where each video clip of the egocentric video corresponds to a node. The constructed graphs are sparse and computationally efficient, outperforming existing larger models substantially. We further leverage alignment between egocentric and exocentric videos during training for improved inference on egocentric videos, as well as adding automatic captioning as an additional modality. We consider each clip of each exocentric video (if available) or video captions as additional nodes during training. We examine several strategies to define connections across these nodes. We perform extensive experiments on the Ego-Exo4D dataset and show that our proposed flexible graph-based framework notably outperforms existing methods.