Underrepresented in Foundation Model Pretraining Data? A One-Shot Probe

📅 2026-03-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of evaluating zero-shot performance of vision-language foundation models in underrepresented domains—such as those in the Global South—where labeled data are scarce. The authors propose a one-shot prediction method that requires only a single annotated image per class. By leveraging a large language model to generate counterfactual image captions and combining these with embeddings from a vision-language model to construct similarity-based features, the approach employs linear regression to accurately predict zero-shot accuracy in the target domain. This is the first method to achieve high-fidelity prediction of cross-domain zero-shot performance from just one example per class, attaining a Pearson correlation coefficient of 0.96 across five diverse datasets. The technique substantially reduces evaluation costs and offers practical support for model selection and annotation strategies in resource-constrained settings.

Technology Category

Application Category

📝 Abstract
Large-scale Vision-Language Foundation Models (VLFMs), such as CLIP, now underpin a wide range of computer vision research and applications. VLFMs are often adapted to various domain-specific tasks. However, VLFM performance on novel, specialised, or underrepresented domains remains inconsistent. Evaluating VLFMs typically requires labelled test sets, which are often unavailable for niche domains of interest, particularly those from the Global South. We address this gap by proposing a highly data-efficient method to predict a VLFM's zero-shot accuracy on a target domain using only a single labelled image per class. Our approach uses a Large Language Model to generate plausible counterfactual descriptions of a given image. By measuring the VLFM's ability to distinguish the correct description from these hard negatives, we engineer features that capture the VLFM's discriminative power in its shared embedding space. A linear regressor trained on these similarity scores estimates the VLFM's zero-shot test accuracy across various visual domains with a Pearson-r correlation of 0.96. We demonstrate our method's performance across five diverse datasets, including standard benchmark datasets and underrepresented datasets from Africa. Our work provides a low-cost, reliable tool for probing VLFMs, enabling researchers and practitioners to make informed decisions about data annotation efforts before committing significant resources. The model training code, generated captions and counterfactuals are released here: https://github.com/chris-vorster/PreLabellingProbe.
Problem

Research questions and friction points this paper is trying to address.

Foundation Models
Zero-shot Accuracy
Underrepresented Domains
Vision-Language Models
Data Scarcity
Innovation

Methods, ideas, or system contributions that make the work stand out.

one-shot probing
vision-language foundation models
counterfactual captioning
zero-shot accuracy prediction
data-efficient evaluation
🔎 Similar Papers
No similar papers found.
C
Chris Vorster
ML-Labs, Dublin City University, Dublin, Ireland
Mayug Maniparambil
Mayug Maniparambil
IIT Madras
Computer VisionFoundation ModelsNLPMulti Modal Models
Noel E. O'Connor
Noel E. O'Connor
CEO, Insight Centre for Data Analytics, Dublin City University
Multimedia content analysisinformation retrievalmachine learningartificial intelligencecomputer vision
N
Noel Murphy
ML-Labs, Dublin City University, Dublin, Ireland
D
Derek Molloy
ML-Labs, Dublin City University, Dublin, Ireland