Benchmarking Corruption Robustness of LVLMs: A Discriminative Benchmark and Robustness Alignment Metric

📅 2025-11-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current LVLM robustness evaluation suffers from two critical limitations: (1) low-discriminative samples dominate benchmarks, obscuring meaningful model differences; and (2) accuracy-based metrics fail to capture structural degradation in predictions. To address these, we propose Bench-C—a novel benchmark constructed via joint filtering of samples exhibiting high prediction inconsistency and semantic diversity—and Robustness Alignment Score (RAS), a logit-level metric quantifying prediction structure stability. RAS decouples robustness into “disruption” and “correction” components and, for the first time, characterizes structural degradation through the lens of uncertainty estimation and calibration alignment. Experiments reveal that even minor perturbations—though occasionally boosting accuracy—consistently degrade prediction structure; models further exhibit distinct patterns of overconfidence and hesitation in erroneous predictions. Bench-C and RAS collectively enhance discriminability of robustness differences across LVLMs.

Technology Category

Application Category

📝 Abstract
Despite the remarkable reasoning abilities of large vision-language models (LVLMs), their robustness under visual corruptions remains insufficiently studied. Existing evaluation paradigms exhibit two major limitations: 1) the dominance of low-discriminative samples in current datasets masks the real robustness gap between models; and 2) conventional accuracy-based metric fail to capture the degradation of the underlying prediction structure. To bridge these gaps, we introduce Bench-C, a comprehensive benchmark emphasizing discriminative samples for assessing corruption robustness, where a selection strategy is proposed to jointly consider the prediction inconsistency under corruption and the semantic diversity. Furthermore, we propose the Robustness Alignment Score (RAS), a unified metric that measures degradation in logit-level prediction structure by considering the shifts in prediction uncertainty and calibration alignment. Comprehensive experiments and analysis reveal several interesting findings: 1) model behaviors exhibit distinguish patterns under corruptions, such as erroneous confidence and hesitation; 2) despite subtle corruption may lead to a slight accuracy gain, the overall prediction structure still degrades; 3) by decomposing corruption robustness into destructive and corrective components, the distinct failure and recovery patterns across models can be revealed.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LVLM robustness to visual corruptions using discriminative samples
Measuring prediction structure degradation beyond accuracy metrics
Analyzing model failure patterns under corruption through decomposition
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bench-C benchmark with discriminative sample selection
Robustness Alignment Score metric for logit degradation
Decomposition of corruption robustness into failure patterns
Xiangjie Sui
Xiangjie Sui
Faculty of Data Science, City University of Macau
image processingvisual quality assessmentcomputer vision
S
Songyang Li
City University of Macau
H
Hanwei Zhu
Nanyang Technological University
B
Baoliang Chen
South China Normal University
Yuming Fang
Yuming Fang
Jiangxi University of Finance and Economics
Image ProcessingVideo Processing3D Multimedia Processing
X
Xin Sun
City University of Macau