CLIP-Guided Adaptable Self-Supervised Learning for Human-Centric Visual Tasks

📅 2026-01-19

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Existing unsupervised pretraining methods struggle to effectively support diverse human-centric vision tasks. To address this limitation, this work proposes CLASP, a novel framework that leverages CLIP to generate multi-granularity semantic pseudo-labels and introduces a prompt-controlled mixture-of-experts (MoE) architecture to enable semantics-aware and task-adaptive unsupervised representation learning. By integrating CLIP-guided pseudo-labeling, multi-task self-supervised learning, and a dynamic expert routing mechanism, CLASP achieves substantial improvements over current unsupervised pretraining approaches across multiple benchmarks, significantly enhancing transfer performance and generalization capabilities on human-centric vision tasks.

Technology Category

Application Category

📝 Abstract

Human-centric visual analysis plays a pivotal role in diverse applications, including surveillance, healthcare, and human-computer interaction. With the emergence of large-scale unlabeled human image datasets, there is an increasing need for a general unsupervised pre-training model capable of supporting diverse human-centric downstream tasks. To achieve this goal, we propose CLASP (CLIP-guided Adaptable Self-suPervised learning), a novel framework designed for unsupervised pre-training in human-centric visual tasks. CLASP leverages the powerful vision-language model CLIP to generate both low-level (e.g., body parts) and high-level (e.g., attributes) semantic pseudo-labels. These multi-level semantic cues are then integrated into the learned visual representations, enriching their expressiveness and generalizability. Recognizing that different downstream tasks demand varying levels of semantic granularity, CLASP incorporates a Prompt-Controlled Mixture-of-Experts (MoE) module. MoE dynamically adapts feature extraction based on task-specific prompts, mitigating potential feature conflicts and enhancing transferability. Furthermore, CLASP employs a multi-task pre-training strategy, where part- and attribute-level pseudo-labels derived from CLIP guide the representation learning process. Extensive experiments across multiple benchmarks demonstrate that CLASP consistently outperforms existing unsupervised pre-training methods, advancing the field of human-centric visual analysis.

Problem

Research questions and friction points this paper is trying to address.

human-centric visual tasks

unsupervised pre-training

visual representation learning

semantic granularity

downstream task transferability

Innovation

Methods, ideas, or system contributions that make the work stand out.

CLIP-guided pseudo-labeling

Prompt-Controlled Mixture-of-Experts

Multi-level semantic representation