🤖 AI Summary
This paper addresses the performance degradation in downstream adaptation of vision-language models (VLMs) caused by severe label imbalance in pseudo-labeling. We systematically identify two root causes: **concept misalignment**—cross-modal semantic shift between vision and language modalities—and **concept confusion**—inter-class discriminability ambiguity. To tackle these, we propose a unified framework integrating **concept alignment and confusion-aware margin calibration**: (1) a contrastive learning–driven concept alignment module mitigates cross-modal semantic shift; (2) an adaptive margin calibration mechanism, grounded in confusion matrix estimation, dynamically refines decision boundaries for ambiguous samples; and (3) class-weighted pseudo-label reweighting coupled with multi-paradigm collaborative training. Evaluated across six benchmark datasets and three learning paradigms, our method significantly improves pseudo-label accuracy and class balance, achieving an average 6.29% relative gain over state-of-the-art methods. Code is publicly available.
📝 Abstract
Adapting vision-language models (VLMs) to downstream tasks with pseudolabels has gained increasing attention. A major obstacle is that the pseudolabels generated by VLMs tend to be imbalanced, leading to inferior performance. While existing methods have explored various strategies to address this, the underlying causes of imbalance remain insufficiently investigated. To fill this gap, we delve into imbalanced pseudolabels and identify two primary contributing factors: concept mismatch and concept confusion. To mitigate these two issues, we propose a novel framework incorporating concept alignment and confusion-aware calibrated margin mechanisms. The core of our approach lies in enhancing underperforming classes and promoting balanced predictions across categories, thus mitigating imbalance. Extensive experiments on six benchmark datasets with three learning paradigms demonstrate that the proposed method effectively enhances the accuracy and balance of pseudolabels, achieving a relative improvement of 6.29% over the SoTA method. Our code is avaliable at https://anonymous.4open.science/r/CAP-C642/