MedQ-Deg: A Multidimensional Benchmark for Evaluating MLLMs Across Medical Image Quality Degradations

πŸ“… 2026-03-08
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This study addresses the insufficient robustness of current medical multimodal large language models (MLLMs) under image quality degradation and the lack of systematic evaluation and confidence calibration analysis. To this end, the authors introduce MedQ-Deg, a comprehensive benchmark comprising 24,894 question-answer pairs across seven imaging modalities, 18 expert-validated degradation types (categorized into three severity levels by radiologists), and 30 capability dimensions. The work presents the first multidimensional evaluation framework for medical image degradation and proposes a novel Calibration Shift metric to quantify confidence-accuracy misalignment under performance degradation, revealing a prevalent "AI Dunning-Kruger effect"β€”where models exhibit high confidence despite low accuracy. Evaluation of 40 state-of-the-art MLLMs demonstrates systematic performance decline with increasing degradation severity and significant variation across modalities, capability dimensions, and degradation types.

Technology Category

Application Category

πŸ“ Abstract
Despite impressive performance on standard benchmarks, multimodal large language models (MLLMs) face critical challenges in real-world clinical environments where medical images inevitably suffer various quality degradations. Existing benchmarks exhibit two key limitations: (1) absence of large-scale, multidimensional assessment across medical image quality gradients and (2) no systematic confidence calibration analysis. To address these gaps, we present MedQ-Deg, a comprehensive benchmark for evaluating medical MLLMs under image quality degradations. MedQ-Deg provides multi-dimensional evaluation spanning 18 distinct degradation types, 30 fine-grained capability dimensions, and 7 imaging modalities, with 24,894 question-answer pairs. Each degradation is implemented at 3 severity degrees, calibrated by expert radiologists. We further introduce Calibration Shift metric, which quantifies the gap between a model's perceived confidence and actual performance to assess metacognitive reliability under degradation. Our comprehensive evaluation of 40 mainstream MLLMs reveals several critical findings: (1) overall model performance degrades systematically as degradation severity increases, (2) models universally exhibit the AI Dunning-Kruger Effect, maintaining inappropriately high confidence despite severe accuracy collapse, and (3) models display markedly differentiated behavioral patterns across capability dimensions, imaging modalities, and degradation types. We hope MedQ-Deg drives progress toward medical MLLMs that are robust and trustworthy in real clinical practice.
Problem

Research questions and friction points this paper is trying to address.

medical image quality degradation
multimodal large language models
confidence calibration
evaluation benchmark
clinical robustness
Innovation

Methods, ideas, or system contributions that make the work stand out.

medical MLLMs
image quality degradation
confidence calibration
multidimensional benchmark
Calibration Shift