Benchmarking Corruption Robustness of LVLMs: A Discriminative Benchmark and Robustness Alignment Metric

📅 2025-11-24

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current LVLM robustness evaluation suffers from two critical limitations: (1) low-discriminative samples dominate benchmarks, obscuring meaningful model differences; and (2) accuracy-based metrics fail to capture structural degradation in predictions. To address these, we propose Bench-C—a novel benchmark constructed via joint filtering of samples exhibiting high prediction inconsistency and semantic diversity—and Robustness Alignment Score (RAS), a logit-level metric quantifying prediction structure stability. RAS decouples robustness into “disruption” and “correction” components and, for the first time, characterizes structural degradation through the lens of uncertainty estimation and calibration alignment. Experiments reveal that even minor perturbations—though occasionally boosting accuracy—consistently degrade prediction structure; models further exhibit distinct patterns of overconfidence and hesitation in erroneous predictions. Bench-C and RAS collectively enhance discriminability of robustness differences across LVLMs.

Technology Category

Application Category

📝 Abstract

Despite the remarkable reasoning abilities of large vision-language models (LVLMs), their robustness under visual corruptions remains insufficiently studied. Existing evaluation paradigms exhibit two major limitations: 1) the dominance of low-discriminative samples in current datasets masks the real robustness gap between models; and 2) conventional accuracy-based metric fail to capture the degradation of the underlying prediction structure. To bridge these gaps, we introduce Bench-C, a comprehensive benchmark emphasizing discriminative samples for assessing corruption robustness, where a selection strategy is proposed to jointly consider the prediction inconsistency under corruption and the semantic diversity. Furthermore, we propose the Robustness Alignment Score (RAS), a unified metric that measures degradation in logit-level prediction structure by considering the shifts in prediction uncertainty and calibration alignment. Comprehensive experiments and analysis reveal several interesting findings: 1) model behaviors exhibit distinguish patterns under corruptions, such as erroneous confidence and hesitation; 2) despite subtle corruption may lead to a slight accuracy gain, the overall prediction structure still degrades; 3) by decomposing corruption robustness into destructive and corrective components, the distinct failure and recovery patterns across models can be revealed.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LVLM robustness to visual corruptions using discriminative samples

Measuring prediction structure degradation beyond accuracy metrics

Analyzing model failure patterns under corruption through decomposition

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bench-C benchmark with discriminative sample selection

Robustness Alignment Score metric for logit degradation

Decomposition of corruption robustness into failure patterns

🔎 Similar Papers

A Comprehensive Survey of Contamination Detection Methods in Large Language Models

2024-03-31Citations: 6

Healing Powers of BERT: How Task-Specific Fine-Tuning Recovers Corrupted Language Models

2024-06-20arXiv.orgCitations: 2

Authors to Follow