TRec: Learning Hand-Object Interactions through 2D Point Track Motion

📅 2026-01-07

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

This work addresses the overreliance on complex appearance or pose cues in egocentric action recognition by proposing a lightweight and efficient motion representation. The method randomly samples 2D points in the initial video frame and tracks their trajectories using CoTracker; both the image frames and corresponding point trajectories are jointly fed into a Transformer to build an end-to-end recognition model. Notably, this is the first approach to leverage object-agnostic, randomly sampled 2D point trajectories as the core motion feature, eliminating the need for explicit detection of hands, objects, or interaction regions and thereby substantially reducing computational overhead. Experiments demonstrate that using only a single frame together with its associated trajectories achieves performance gains across multiple benchmarks, validating the effectiveness and generalization capability of the proposed representation.

Technology Category

Application Category

📝 Abstract

We present a novel approach for hand-object action recognition that leverages 2D point tracks as an additional motion cue. While most existing methods rely on RGB appearance, human pose estimation, or their combination, our work demonstrates that tracking randomly sampled image points across video frames can substantially improve recognition accuracy. Unlike prior approaches, we do not detect hands, objects, or interaction regions. Instead, we employ CoTracker to follow a set of randomly initialized points through each video and use the resulting trajectories, together with the corresponding image frames, as input to a Transformer-based recognition model. Surprisingly, our method achieves notable gains even when only the initial frame and the point tracks are provided, without incorporating the full video sequence. Experimental results confirm that integrating 2D point tracks consistently enhances performance compared to the same model trained without motion information, highlighting their potential as a lightweight yet effective representation for hand-object action understanding.

Problem

Research questions and friction points this paper is trying to address.

egocentric action recognition

2D point tracks

motion cue

video understanding

action recognition

Innovation

Methods, ideas, or system contributions that make the work stand out.

egocentric action recognition

2D point tracks

CoTracker

motion representation

Transformer-based model

🔎 Similar Papers

No similar papers found.