Are vision language models robust to uncertain inputs?

📅 2025-05-17

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study investigates the robustness of vision-language models (VLMs) under uncertain and ambiguous inputs, revealing that while scaling improves robustness marginally, strong instruction-following tendencies still induce hallucination. To address this, we propose the first unsupervised uncertainty quantification method based on generative caption diversity—requiring no annotations yet reliably estimating model confidence. We further introduce *abstention prompting*, enabling near-100% robustness on general benchmarks such as ImageNet. Experiments also uncover domain-specific knowledge gaps—e.g., in galaxy morphology classification—as critical reliability bottlenecks. Our diversity-based uncertainty metric significantly outperforms conventional confidence baselines across multiple uncertainty-aware tasks, including anomaly detection and ambiguous classification, demonstrating both efficacy and broad applicability.

Technology Category

Application Category

📝 Abstract

Robustness against uncertain and ambiguous inputs is a critical challenge for deep learning models. While recent advancements in large scale vision language models (VLMs, e.g. GPT4o) might suggest that increasing model and training dataset size would mitigate this issue, our empirical evaluation shows a more complicated picture. Testing models using two classic uncertainty quantification tasks, anomaly detection and classification under inherently ambiguous conditions, we find that newer and larger VLMs indeed exhibit improved robustness compared to earlier models, but still suffer from a tendency to strictly follow instructions, often causing them to hallucinate confident responses even when faced with unclear or anomalous inputs. Remarkably, for natural images such as ImageNet, this limitation can be overcome without pipeline modifications: simply prompting models to abstain from uncertain predictions enables significant reliability gains, achieving near-perfect robustness in several settings. However, for domain-specific tasks such as galaxy morphology classification, a lack of specialized knowledge prevents reliable uncertainty estimation. Finally, we propose a novel mechanism based on caption diversity to reveal a model's internal uncertainty, enabling practitioners to predict when models will successfully abstain without relying on labeled data.

Problem

Research questions and friction points this paper is trying to address.

Assessing VLMs' robustness to uncertain ambiguous inputs

Overcoming hallucination in VLMs with uncertain prompts

Enhancing uncertainty estimation in domain-specific VLM tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Prompt models to abstain from uncertain predictions

Use caption diversity to reveal internal uncertainty

Test models on anomaly detection and ambiguous classification

🔎 Similar Papers

Better Language Models Exhibit Higher Visual Alignment

2024-10-09Citations: 0

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling

2024-02-09European Conference on Computer VisionCitations: 29

Authors to Follow