Prompt-Induced Score Variance in Zero-Shot Binary Vision-Language Safety Classification

📅 2026-04-30

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses the instability of zero-shot vision-language safety classification, where the probability assigned to the first token of a single prompt exhibits high sensitivity to semantically equivalent prompt rewrites, leading to unreliable decisions and elevated error rates. The study systematically demonstrates, for the first time, the substantial impact of prompt rephrasing on safety scores. To mitigate this issue, the authors propose a training-free baseline that aggregates predictions across a family of prompts via mean ensembling, and further enhance reliability by incorporating supervised calibration techniques—specifically temperature scaling, Platt scaling, and isotonic regression. Evaluated across 14 dataset-model combinations, the mean ensemble consistently reduces negative log-likelihood (NLL), improves expected calibration error (ECE) in 12 out of 14 settings, and generally achieves higher AUPRC than conventional calibration methods.

📝 Abstract

Single-prompt first-token probabilities from zero-shot vision-language model (VLM) safety classifiers are treated as decision scores, but we show they are unreliable under semantically equivalent prompt reformulation: even when the binary label is constrained to a fixed output position, equivalent prompts can induce materially different unsafe probabilities for the same sample. Across multimodal safety benchmarks and multiple VLM families, cross-prompt variance is strongly associated with prompt-level disagreement and higher error, making it a useful fragility diagnostic. A training-free mean ensemble improves NLL on all 14 dataset-model evaluation pairs and ECE on 12/14 relative to a train-selected single-prompt baseline, and wins more head-to-head NLL comparisons than labeled temperature scaling, Platt scaling, and isotonic regression applied to the same prompt. Ranking gains are consistent against the train-selected baseline on both AUROC and AUPRC, and against the full 15-prompt distribution remain consistent on AUPRC while softening on AUROC. Labeled calibration on top of the mean provides further gains when labels are available, identifying prompt averaging as a strong label-free first stage rather than a replacement for calibration. We frame this as a reliability stress test for zero-shot VLM first-token safety scores and recommend prompt-family evaluation with mean aggregation as a standard label-free reliability baseline.

Problem

Research questions and friction points this paper is trying to address.

zero-shot

vision-language model

safety classification

prompt variance

score reliability

Innovation

Methods, ideas, or system contributions that make the work stand out.

prompt variance

zero-shot VLM

safety classification