Leveraging Vision-Language Pre-training for Human Activity Recognition in Still Images

📅 2025-06-16

📈 Citations: 0

✨ Influential: 0

career value

146K/year

🤖 AI Summary

This study addresses human activity recognition (e.g., walking, running, sitting, standing) from single static images—without motion cues—to support applications such as image retrieval, intelligent surveillance, and assisted living. To overcome the limited performance of conventional CNNs on this task, we systematically investigate and enhance the applicability of contrastive vision–language pre-trained models (specifically CLIP) for static action recognition. Leveraging transfer learning and fine-tuning, we perform cross-modal alignment training using the multi-label MSCOCO dataset. On a test set of 285 real-world images, our method achieves 76% accuracy—surpassing a from-scratch CNN baseline by 35 percentage points. Our key contribution is the empirical validation that vision–language pre-trained models possess strong capacity for modeling static action semantics, thereby establishing a novel paradigm for temporal-agnostic action understanding.

Technology Category

Application Category

📝 Abstract

Recognising human activity in a single photo enables indexing, safety and assistive applications, yet lacks motion cues. Using 285 MSCOCO images labelled as walking, running, sitting, and standing, scratch CNNs scored 41% accuracy. Fine-tuning multimodal CLIP raised this to 76%, demonstrating that contrastive vision-language pre-training decisively improves still-image action recognition in real-world deployments.

Problem

Research questions and friction points this paper is trying to address.

Improving human activity recognition in still images

Overcoming lack of motion cues for action recognition

Enhancing accuracy using vision-language pre-training models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages vision-language pre-training for activity recognition

Fine-tunes multimodal CLIP for improved accuracy

Uses contrastive learning for still-image action recognition

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs