Neutral-Reference Prompting for Vision-Language Models

📅 2026-05-15

📈 Citations: 0

✨ Influential: 0

career value

141K/year

🤖 AI Summary

This work addresses the common trade-off in vision-language models between performance on base (seen) and novel (unseen) classes during transfer learning, where gains on novel classes often degrade base-class accuracy. The authors propose NeRP, a plug-and-play prompt correction strategy that requires no fine-tuning. NeRP is the first to identify an asymmetric class confusion phenomenon in downstream tasks and leverages neutral text prompts alongside reference images to estimate the pretrained model’s category prior bias. By combining this prior with sample likelihood, NeRP constructs a joint prior-evidence proxy score to selectively flip predictions for easily confused categories. Evaluated across 15 few-shot and cross-domain benchmarks, NeRP substantially improves novel-class accuracy while preserving base-class performance, demonstrating broad compatibility with various backbone architectures.

📝 Abstract

Efficient transfer learning of vision-language models (VLMs) commonly suffers from a Base-New Trade-off (BNT): improving performance on unseen (new) classes often degrades accuracy on known (base) classes. Addressing how to boost recognition of unseen classes without sacrificing known-class performance remains a central challenge. Existing work often simplistically attributes the BNT to overfitting on known classes. We observe an interesting phenomenon: VLMs frequently exhibit asymmetric confusion on certain downstream data, i.e., samples of class A are systematically mispredicted as class B, while the reverse confusion (B to A) rarely occurs. For known classes, this kind of bias can be mitigated by tuning using a cross-entropy loss, but for unseen classes, such pretraining-induced bias persists and harms generalization. Motivated by this, we propose NeRP, a plug-and-play prompting correction strategy that improves discrimination on unseen classes without modifying model parameters. NeRP leverages neutral text prompts and reference images to measure class-wise prior preferences along the pre-trained inter-class geometry, and combines them with the sample likelihood to obtain the model's surrogate score. If, for a given sample, the prior strongly favors the current prediction while the observed evidence is clearly insufficient, we perform a local flip between easily confusable class pairs, thereby correcting prior-dominated mispredictions. Extensive experiments across multiple backbones and 15 few-shot and cross-domain benchmarks show that NeRP substantially improves accuracy on unseen classes while preserving known-class prediction performance.

Problem

Research questions and friction points this paper is trying to address.

Base-New Trade-off

vision-language models

unseen class recognition

pretraining bias

asymmetric confusion

Innovation

Methods, ideas, or system contributions that make the work stand out.

Neutral-Reference Prompting

Vision-Language Models

Base-New Trade-off