🤖 AI Summary
This work addresses the challenge of medical image classification under extremely limited and imbalanced labeled data, where existing vision-language models suffer from performance degradation due to high annotation costs. The authors propose a novel semi-supervised few-shot adaptation method that, for the first time, integrates a text-guided pseudo-label propagation mechanism into pre-trained vision-language models. By leveraging a multimodal linear probe, the approach effectively utilizes unlabeled data to generate high-quality pseudo-labels. This strategy substantially reduces reliance on annotated samples, successfully mitigates class imbalance, and enhances overall model performance—even when using less than half of the original labeled data.
📝 Abstract
Vision-language models (VLMs) pre-trained on large, heterogeneous data sources are becoming increasingly popular, providing rich multi-modal embeddings that enable efficient transfer to new tasks. A particularly relevant application is few-shot adaptation, where only a handful of annotated examples are available to adapt the model through multi-modal linear probes. In medical imaging, specialized VLMs have shown promising performance in zero- and few-shot image classification, which is valuable for mitigating the high cost of expert annotations. However, challenges remain in extremely low-shot regimes: the inherent class imbalances in medical tasks often lead to underrepresented categories, penalizing overall model performance. To address this limitation, we propose leveraging unlabeled data by introducing an efficient semi-supervised solver that propagates text-informed pseudo-labels during few-shot adaptation. The proposed method enables lower-budget annotation pipelines for adapting VLMs, reducing labeling effort by >50% in low-shot regimes.