🤖 AI Summary
Existing vision-language models (VLMs) struggle with daily activity video understanding due to visual similarity, subtle motion patterns, viewpoint variations, and the absence of language-guided supervision—leading to poor zero-shot generalization. To address this, we propose SkeletonCLIP: the first co-training framework that explicitly integrates 3D skeletal representations into the vision-language embedding space via skeleton–image–text multimodal contrastive learning. Crucially, skeleton supervision is used only during training; no skeleton input is required at inference. By unifying 3D pose modeling, VLMs, and large vision-language models (LVLMs), SkeletonCLIP jointly optimizes zero-shot action recognition and video captioning. Evaluated on three benchmark ADL datasets, it achieves substantial gains: +12.7% average accuracy in zero-shot action recognition and +4.3 BLEU-4 points in video description generation. SkeletonCLIP establishes the first practical paradigm that leverages skeleton priors for training while maintaining skeleton-free inference.
📝 Abstract
The introduction of vision-language models like CLIP has enabled the development of foundational video models capable of generalizing to unseen videos and human actions. However, these models are typically trained on web videos, which often fail to capture the challenges present in Activities of Daily Living (ADL) videos. Existing works address ADL-specific challenges, such as similar appearances, subtle motion patterns, and multiple viewpoints, by combining 3D skeletons and RGB videos. However, these approaches are not integrated with language, limiting their ability to generalize to unseen action classes. In this paper, we introduce SKI models, which integrate 3D skeletons into the vision-language embedding space. SKI models leverage a skeleton-language model, SkeletonCLIP, to infuse skeleton information into Vision Language Models (VLMs) and Large Vision Language Models (LVLMs) through collaborative training. Notably, SKI models do not require skeleton data during inference, enhancing their robustness for real-world applications. The effectiveness of SKI models is validated on three popular ADL datasets for zero-shot action recognition and video caption generation tasks.