Is Temporal Prompting All We Need For Limited Labeled Action Recognition?

📅 2025-04-02

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

To address the challenges of scarce annotated data, high computational cost in temporal modeling, and inefficient adaptation of vision-language models in few-shot/zero-shot video action recognition, this paper proposes Temporal Prompting—a lightweight, plug-and-play method that injects learnable temporal prompts solely at the input of CLIP’s image encoder, without modifying its backbone. This enables efficient sequence-level modeling of video frames. Leveraging contrastive pretraining transfer and parameter-efficient fine-tuning, our approach drastically reduces computational overhead. Evaluated on multiple benchmarks, it surpasses state-of-the-art methods by up to 15.8% in accuracy, while requiring only one-third the GFLOPs and merely 1/28 the tunable parameters. The method thus achieves a favorable trade-off between performance gain and deployment efficiency, establishing a new paradigm for resource-constrained video understanding.

Technology Category

Application Category

📝 Abstract

Video understanding has shown remarkable improvements in recent years, largely dependent on the availability of large scaled labeled datasets. Recent advancements in visual-language models, especially based on contrastive pretraining, have shown remarkable generalization in zero-shot tasks, helping to overcome this dependence on labeled datasets. Adaptations of such models for videos, typically involve modifying the architecture of vision-language models to cater to video data. However, this is not trivial, since such adaptations are mostly computationally intensive and struggle with temporal modeling. We present TP-CLIP, an adaptation of CLIP that leverages temporal visual prompting for temporal adaptation without modifying the core CLIP architecture. This preserves its generalization abilities. TP-CLIP efficiently integrates into the CLIP architecture, leveraging its pre-trained capabilities for video data. Extensive experiments across various datasets demonstrate its efficacy in zero-shot and few-shot learning, outperforming existing approaches with fewer parameters and computational efficiency. In particular, we use just 1/3 the GFLOPs and 1/28 the number of tuneable parameters in comparison to recent state-of-the-art and still outperform it by up to 15.8% depending on the task and dataset.

Problem

Research questions and friction points this paper is trying to address.

Adapting visual-language models for video temporal modeling

Reducing computational cost in video action recognition

Improving zero-shot and few-shot learning performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages temporal visual prompting

Preserves CLIP generalization abilities

Reduces computational and parameter costs

🔎 Similar Papers

A Comprehensive Review of Few-shot Action Recognition