🤖 AI Summary
This work addresses the inherent permutation invariance of image pre-trained models (e.g., DINOv2, CLIP) when applied to video temporal modeling—specifically, their inability to distinguish near-symmetric actions (e.g., opening vs. closing a bottle). To tackle this, we propose STEP, a parameter-efficient probing method that explicitly encodes temporal order during image-to-video transfer. STEP introduces learnable frame-level positional embeddings, a global [CLS] token, and a simplified self-attention mechanism—while leaving the frozen image backbone unchanged. Only lightweight probes are fine-tuned. Evaluated on four action recognition benchmarks, STEP achieves average accuracy gains of 3–15% over baseline probes, using only one-third the parameters of full fine-tuning; it outperforms full fine-tuning on two major datasets. Crucially, for near-symmetric actions, STEP improves accuracy by 9–19% over prior probes, substantially mitigating action direction ambiguity.
📝 Abstract
We study parameter-efficient image-to-video probing for the unaddressed challenge of recognizing nearly symmetric actions - visually similar actions that unfold in opposite temporal order (e.g., opening vs. closing a bottle). Existing probing mechanisms for image-pretrained models, such as DinoV2 and CLIP, rely on attention mechanism for temporal modeling but are inherently permutation-invariant, leading to identical predictions regardless of frame order. To address this, we introduce Self-attentive Temporal Embedding Probing (STEP), a simple yet effective approach designed to enforce temporal sensitivity in parameter-efficient image-to-video transfer. STEP enhances self-attentive probing with three key modifications: (1) a learnable frame-wise positional encoding, explicitly encoding temporal order; (2) a single global CLS token, for sequence coherence; and (3) a simplified attention mechanism to improve parameter efficiency. STEP outperforms existing image-to-video probing mechanisms by 3-15% across four activity recognition benchmarks with only 1/3 of the learnable parameters. On two datasets, it surpasses all published methods, including fully fine-tuned models. STEP shows a distinct advantage in recognizing nearly symmetric actions, surpassing other probing mechanisms by 9-19%. and parameter-heavier PEFT-based transfer methods by 5-15%. Code and models will be made publicly available.