MedQ-Bench: Evaluating and Exploring Medical Image Quality Assessment Abilities in MLLMs

📅 2025-10-02

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Existing medical image quality assessment (IQA) methods rely on scalar scores and fail to model radiologists’ human-like, natural-language-based reasoning. To address this, we propose MedQ-Bench—the first MLLM-oriented IQA benchmark for medical imaging—featuring a dual-task evaluation framework encompassing perception (e.g., artifact detection) and reasoning (e.g., cause analysis, clinical impact assessment). It covers five imaging modalities and 40+ quality attributes, integrating physics-based simulations and AI-generated data to enhance generalizability. We introduce a multi-axis evaluation protocol and, for the first time, empirically validate consistency between MLLM assessments and clinical expert judgments. Comprehensive evaluation across 14 state-of-the-art MLLMs reveals that current models exhibit preliminary perceptual and reasoning capabilities but suffer from limited stability and clinical reliability. MedQ-Bench is publicly released to advance language-driven, interpretable medical image quality control research.

Technology Category

Application Category

📝 Abstract

Medical Image Quality Assessment (IQA) serves as the first-mile safety gate for clinical AI, yet existing approaches remain constrained by scalar, score-based metrics and fail to reflect the descriptive, human-like reasoning process central to expert evaluation. To address this gap, we introduce MedQ-Bench, a comprehensive benchmark that establishes a perception-reasoning paradigm for language-based evaluation of medical image quality with Multi-modal Large Language Models (MLLMs). MedQ-Bench defines two complementary tasks: (1) MedQ-Perception, which probes low-level perceptual capability via human-curated questions on fundamental visual attributes; and (2) MedQ-Reasoning, encompassing both no-reference and comparison reasoning tasks, aligning model evaluation with human-like reasoning on image quality. The benchmark spans five imaging modalities and over forty quality attributes, totaling 2,600 perceptual queries and 708 reasoning assessments, covering diverse image sources including authentic clinical acquisitions, images with simulated degradations via physics-based reconstructions, and AI-generated images. To evaluate reasoning ability, we propose a multi-dimensional judging protocol that assesses model outputs along four complementary axes. We further conduct rigorous human-AI alignment validation by comparing LLM-based judgement with radiologists. Our evaluation of 14 state-of-the-art MLLMs demonstrates that models exhibit preliminary but unstable perceptual and reasoning skills, with insufficient accuracy for reliable clinical use. These findings highlight the need for targeted optimization of MLLMs in medical IQA. We hope that MedQ-Bench will catalyze further exploration and unlock the untapped potential of MLLMs for medical image quality evaluation.

Problem

Research questions and friction points this paper is trying to address.

Evaluating medical image quality assessment in MLLMs

Establishing perception-reasoning paradigm for clinical safety

Addressing limitations of scalar metrics with descriptive evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces MedQ-Bench benchmark for medical image quality assessment

Proposes perception-reasoning paradigm using Multi-modal Large Language Models

Includes multi-dimensional judging protocol and human-AI alignment validation

🔎 Similar Papers

MediConfusion: Can you trust your AI radiologist? Probing the reliability of multimodal medical foundation models