MedQ-Bench: Evaluating and Exploring Medical Image Quality Assessment Abilities in MLLMs

📅 2025-10-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing medical image quality assessment (IQA) methods rely on scalar scores and fail to model radiologists’ human-like, natural-language-based reasoning. To address this, we propose MedQ-Bench—the first MLLM-oriented IQA benchmark for medical imaging—featuring a dual-task evaluation framework encompassing perception (e.g., artifact detection) and reasoning (e.g., cause analysis, clinical impact assessment). It covers five imaging modalities and 40+ quality attributes, integrating physics-based simulations and AI-generated data to enhance generalizability. We introduce a multi-axis evaluation protocol and, for the first time, empirically validate consistency between MLLM assessments and clinical expert judgments. Comprehensive evaluation across 14 state-of-the-art MLLMs reveals that current models exhibit preliminary perceptual and reasoning capabilities but suffer from limited stability and clinical reliability. MedQ-Bench is publicly released to advance language-driven, interpretable medical image quality control research.

Technology Category

Application Category

📝 Abstract
Medical Image Quality Assessment (IQA) serves as the first-mile safety gate for clinical AI, yet existing approaches remain constrained by scalar, score-based metrics and fail to reflect the descriptive, human-like reasoning process central to expert evaluation. To address this gap, we introduce MedQ-Bench, a comprehensive benchmark that establishes a perception-reasoning paradigm for language-based evaluation of medical image quality with Multi-modal Large Language Models (MLLMs). MedQ-Bench defines two complementary tasks: (1) MedQ-Perception, which probes low-level perceptual capability via human-curated questions on fundamental visual attributes; and (2) MedQ-Reasoning, encompassing both no-reference and comparison reasoning tasks, aligning model evaluation with human-like reasoning on image quality. The benchmark spans five imaging modalities and over forty quality attributes, totaling 2,600 perceptual queries and 708 reasoning assessments, covering diverse image sources including authentic clinical acquisitions, images with simulated degradations via physics-based reconstructions, and AI-generated images. To evaluate reasoning ability, we propose a multi-dimensional judging protocol that assesses model outputs along four complementary axes. We further conduct rigorous human-AI alignment validation by comparing LLM-based judgement with radiologists. Our evaluation of 14 state-of-the-art MLLMs demonstrates that models exhibit preliminary but unstable perceptual and reasoning skills, with insufficient accuracy for reliable clinical use. These findings highlight the need for targeted optimization of MLLMs in medical IQA. We hope that MedQ-Bench will catalyze further exploration and unlock the untapped potential of MLLMs for medical image quality evaluation.
Problem

Research questions and friction points this paper is trying to address.

Evaluating medical image quality assessment in MLLMs
Establishing perception-reasoning paradigm for clinical safety
Addressing limitations of scalar metrics with descriptive evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces MedQ-Bench benchmark for medical image quality assessment
Proposes perception-reasoning paradigm using Multi-modal Large Language Models
Includes multi-dimensional judging protocol and human-AI alignment validation
🔎 Similar Papers
No similar papers found.
J
Jiyao Liu
Fudan University
Jinjie Wei
Jinjie Wei
Fudan University
Large Language Model
W
Wanying Qu
Fudan University
Chenglong Ma
Chenglong Ma
Fudan University; Shanghai Innovation Institute
multi-modal modelsgenerative modelsmedical image analysis
J
Junzhi Ning
Shanghai Artificial Intelligence Laboratory
Yunheng Li
Yunheng Li
Nankai University
Computer Vision
Y
Ying Chen
Shanghai Artificial Intelligence Laboratory
Xinzhe Luo
Xinzhe Luo
Imperial College London
machine learningmedical imagingstatisticsinverse problems
P
Pengcheng Chen
Shanghai Artificial Intelligence Laboratory
X
Xin Gao
Fudan University
M
Ming Hu
Shanghai Artificial Intelligence Laboratory
H
Huihui Xu
Shanghai Artificial Intelligence Laboratory
X
Xin Wang
Shanghai Artificial Intelligence Laboratory
S
Shujian Gao
Fudan University
Dingkang Yang
Dingkang Yang
ByteDance
Multimodal LearningGenerative AIEmbodied AI
Zhongying Deng
Zhongying Deng
University of Cambridge
Deep LearningMulti-modal LearningComputer VisionMedical Image Analysis
J
Jin Ye
Shanghai Artificial Intelligence Laboratory
Lihao Liu
Lihao Liu
Amazon
LLM-based AgentHealthcare AI
Junjun He
Junjun He
Shanghai Jiao Tong University
N
Ningsheng Xu
Fudan University