🤖 AI Summary
This study addresses human activity recognition (e.g., walking, running, sitting, standing) from single static images—without motion cues—to support applications such as image retrieval, intelligent surveillance, and assisted living. To overcome the limited performance of conventional CNNs on this task, we systematically investigate and enhance the applicability of contrastive vision–language pre-trained models (specifically CLIP) for static action recognition. Leveraging transfer learning and fine-tuning, we perform cross-modal alignment training using the multi-label MSCOCO dataset. On a test set of 285 real-world images, our method achieves 76% accuracy—surpassing a from-scratch CNN baseline by 35 percentage points. Our key contribution is the empirical validation that vision–language pre-trained models possess strong capacity for modeling static action semantics, thereby establishing a novel paradigm for temporal-agnostic action understanding.
📝 Abstract
Recognising human activity in a single photo enables indexing, safety and assistive applications, yet lacks motion cues. Using 285 MSCOCO images labelled as walking, running, sitting, and standing, scratch CNNs scored 41% accuracy. Fine-tuning multimodal CLIP raised this to 76%, demonstrating that contrastive vision-language pre-training decisively improves still-image action recognition in real-world deployments.