π€ AI Summary
Parameter-efficient fine-tuning (PEFT) of vision-language models (VLMs) suffers from spurious correlations and degraded group robustness due to subgroup data imbalance. Method: We propose Group-aware Prompt Learning (GPL), a novel PEFT framework that introduces subgroup-specific textual prompts as lightweight multi-classifiers. By leveraging the strong semantic generalization capacity of the text encoder, GPL mitigates bias in underrepresented subgroups and optimizes class embedding distributions. Our approach fine-tunes only 0.016% of model parameters and incorporates context-aware prompt optimization. Contribution/Results: GPL achieves significant improvements in group robustness across five CLIP-based benchmarks, outperforming several full fine-tuning methods. It establishes a new paradigm for efficient and fair VLM adaptation, balancing parameter efficiency, robustness, and fairness without architectural modification.
π Abstract
Parameter-efficient fine-tuning (PEFT) of vision-language models (VLMs) excels in various vision tasks thanks to the rich knowledge and generalization ability of VLMs. However, recent studies revealed that such fine-tuned VLMs are vulnerable to spurious correlations stemming from the subgroup imbalance in the fine-tuning datasets. To resolve this issue, we propose Group Context Optimization (GroupCoOp), a simple and effective debiased fine-tuning algorithm that enhances the group robustness of fine-tuned VLMs. Its key idea is to employ group-specific text prompts as group representatives serving as multiple classifiers for their target class. The rich semantic knowledge of the text encoder of VLM enables the discovery of effective group prompts even for groups with a small number of training samples. Leveraging the group prompts for each class addresses the issues caused by the group-imbalanced training set, such as the neglect of minority groups and the scattered distribution of each class in the embedding space. GroupCoOp achieved the best results on five benchmarks across five CLIP architectures and occasionally outperformed prior methods that fine-tune the entire network, despite training only 0.016% of the network's parameters.