Will It Zero-Shot?: Predicting Zero-Shot Classification Performance For Arbitrary Queries

📅 2026-01-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge non-expert users face in anticipating the performance of vision-language models (VLMs) on zero-shot classification tasks. To this end, the authors propose a novel, annotation-free performance prediction method that leverages textual embedding similarity to generate task-relevant synthetic images. These images are then fused with the original textual prompts, enabling the first image-augmented approach to zero-shot performance prediction. Built upon the CLIP framework and integrating text-to-image generation techniques, the method significantly outperforms text-only baselines on standard benchmarks. It not only enhances prediction accuracy but also provides interpretable visual feedback, thereby effectively assisting users in assessing the suitability of a VLM for a given task.

Technology Category

Application Category

📝 Abstract
Vision-Language Models like CLIP create aligned embedding spaces for text and images, making it possible for anyone to build a visual classifier by simply naming the classes they want to distinguish. However, a model that works well in one domain may fail in another, and non-expert users have no straightforward way to assess whether their chosen VLM will work on their problem. We build on prior work using text-only comparisons to evaluate how well a model works for a given natural language task, and explore approaches that also generate synthetic images relevant to that task to evaluate and refine the prediction of zero-shot accuracy. We show that generated imagery to the baseline text-only scores substantially improves the quality of these predictions. Additionally, it gives a user feedback on the kinds of images that were used to make the assessment. Experiments on standard CLIP benchmark datasets demonstrate that the image-based approach helps users predict, without any labeled examples, whether a VLM will be effective for their application.
Problem

Research questions and friction points this paper is trying to address.

zero-shot classification
vision-language models
performance prediction
CLIP
synthetic images
Innovation

Methods, ideas, or system contributions that make the work stand out.

zero-shot classification
vision-language models
synthetic image generation
performance prediction
CLIP
🔎 Similar Papers
No similar papers found.
K
Kevin Robbins
Computer Science, George Washington University, Washington, DC, USA
X
Xiaotong Liu
Computer Science, George Washington University, Washington, DC, USA
Yu Wu
Yu Wu
University of Cambridge
machine learninghealth sensingmobile health
Le Sun
Le Sun
Institute of Software, CAS
information_retrievalnatural_language_processing
G
Grady McPeak
Computer Science, George Washington University, Washington, DC, USA
Abby Stylianou
Abby Stylianou
Associate Professor, Saint Louis University
Computer VisionMachine Learning
Robert Pless
Robert Pless
George Washington University
computer visionmachine learningvisualizationexplainable AIcitizen science