🤖 AI Summary
This work addresses the overreliance on complex appearance or pose cues in egocentric action recognition by proposing a lightweight and efficient motion representation. The method randomly samples 2D points in the initial video frame and tracks their trajectories using CoTracker; both the image frames and corresponding point trajectories are jointly fed into a Transformer to build an end-to-end recognition model. Notably, this is the first approach to leverage object-agnostic, randomly sampled 2D point trajectories as the core motion feature, eliminating the need for explicit detection of hands, objects, or interaction regions and thereby substantially reducing computational overhead. Experiments demonstrate that using only a single frame together with its associated trajectories achieves performance gains across multiple benchmarks, validating the effectiveness and generalization capability of the proposed representation.
📝 Abstract
We present a novel approach for hand-object action recognition that leverages 2D point tracks as an additional motion cue. While most existing methods rely on RGB appearance, human pose estimation, or their combination, our work demonstrates that tracking randomly sampled image points across video frames can substantially improve recognition accuracy. Unlike prior approaches, we do not detect hands, objects, or interaction regions. Instead, we employ CoTracker to follow a set of randomly initialized points through each video and use the resulting trajectories, together with the corresponding image frames, as input to a Transformer-based recognition model. Surprisingly, our method achieves notable gains even when only the initial frame and the point tracks are provided, without incorporating the full video sequence. Experimental results confirm that integrating 2D point tracks consistently enhances performance compared to the same model trained without motion information, highlighting their potential as a lightweight yet effective representation for hand-object action understanding.