🤖 AI Summary
To address insufficient trajectory point selection and motion modeling in few-shot action recognition, this paper proposes a semantic-aware trajectory modeling framework. Methodologically, it integrates point tracking, saliency estimation, Histogram of Directions (HoD) encoding, and a relational token network. Its key contributions are: (1) a semantic-saliency-guided point sampling strategy that prioritizes informative and discriminative tracked points; and (2) joint modeling of intra-trajectory motion features (via HoD) and inter-trajectory semantic relationships using learned relational tokens to achieve deep motion-appearance fusion. Evaluated on six mainstream few-shot action recognition benchmarks—including Something-Something-V2, Kinetics, and UCF101—the method achieves state-of-the-art performance, demonstrating significant improvements in few-shot generalization capability and fine-grained action discrimination.
📝 Abstract
Video understanding requires effective modeling of both motion and appearance information, particularly for few-shot action recognition. While recent advances in point tracking have been shown to improve few-shot action recognition, two fundamental challenges persist: selecting informative points to track and effectively modeling their motion patterns. We present Trokens, a novel approach that transforms trajectory points into semantic-aware relational tokens for action recognition. First, we introduce a semantic-aware sampling strategy to adaptively distribute tracking points based on object scale and semantic relevance. Second, we develop a motion modeling framework that captures both intra-trajectory dynamics through the Histogram of Oriented Displacements (HoD) and inter-trajectory relationships to model complex action patterns. Our approach effectively combines these trajectory tokens with semantic features to enhance appearance features with motion information, achieving state-of-the-art performance across six diverse few-shot action recognition benchmarks: Something-Something-V2 (both full and small splits), Kinetics, UCF101, HMDB51, and FineGym. For project page see https://trokens-iccv25.github.io