Escaping the SpuriVerse: Can Large Vision-Language Models Generalize Beyond Seen Spurious Correlations?

📅 2025-06-23

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

This work identifies a critical vulnerability in large vision-language models (LVLMs): during unsupervised pretraining, they readily learn spurious correlations, leading to shortcut reliance on non-essential visual features and undermining robustness in real-world visual question answering (VQA). To address this, the authors introduce SpuriVerse—the first benchmark grounded in empirically mined VQA errors—comprising 124 authentic spurious patterns, along with a counterfactual sample synthesis framework and an LVLM-human collaborative annotation protocol. Evaluating 15 state-of-the-art LVLMs on SpuriVerse reveals that top closed-source models achieve only 37.1% accuracy. In contrast, fine-tuning with spurious-correlation-aware synthetic data boosts performance to 78.40%, demonstrating that such shortcuts are learnable and correctable. This study provides the first systematic quantification of LVLM robustness against spurious correlations, establishing a novel evaluation paradigm and actionable intervention strategy for trustworthy multimodal reasoning.

Technology Category

Application Category

📝 Abstract

Finetuning can cause spurious correlations to arise between non-essential features and the target labels, but benchmarks to study these effects involve contrived settings and narrow tasks. In contrast, we consider spurious correlations in multi-modal Large Vision Language Models (LVLMs) pretrained on extensive and diverse datasets without explicit task supervision. We develop a benchmark by sourcing GPT-4o errors on real-world visual-question-answering (VQA) benchmarks, then curating a subset through LVLM-human annotation and synthetic counterfactual evaluation to identify errors caused by spurious correlations. This process yields SpuriVerse, a novel benchmark comprised of 124 distinct types of spurious correlations extracted from real-world datasets, each containing 1 realistic and 10 synthetic VQA samples for a total of 1364 multiple choice questions. We evaluate 15 open and closed-source LVLMs on SpuriVerse, finding that even state-of-the-art closed-source models struggle significantly, achieving at best only 37.1% accuracy. Fine-tuning on synthetic examples that emphasize the spurious correlation improves performance to 78.40%, suggesting that training on diverse spurious patterns generalizes to unseen situations: models appear to learn to avoid "shortcuts" and attend to the overall image context.

Problem

Research questions and friction points this paper is trying to address.

Studying spurious correlations in multi-modal Large Vision-Language Models

Developing a benchmark to identify real-world spurious correlation errors

Evaluating LVLM performance and improving generalization via synthetic training

Innovation

Methods, ideas, or system contributions that make the work stand out.

GPT-4o errors sourced for benchmark creation

Synthetic counterfactual evaluation identifies spurious correlations

Fine-tuning on diverse spurious patterns improves generalization

🔎 Similar Papers

From Pixels to Tokens: Revisiting Object Hallucinations in Large Vision-Language Models