Metrics that matter: Evaluating image quality metrics for medical image generation

📅 2025-05-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current no-reference image quality metrics (NR-IQMs) lack clinical reliability for evaluating medical image generation models when ground-truth references are unavailable. Method: We systematically assess BRISQUE, NIQE, and CLIPIQ on brain MRI generation—including tumor and vascular imaging—via controlled perturbation experiments, cross-architecture generative model analysis, and correlation modeling with downstream segmentation Dice scores. Contribution/Results: We find that mainstream NR-IQMs exhibit weak correlation with clinical validity (r < 0.3), critically neglecting anatomically relevant local structures and pathological details, while erroneously rewarding distributional artifacts such as data memorization. Crucially, high NR-IQM scores often mask fundamental model deficiencies, posing tangible risks for clinical deployment. To address this, we propose a multidimensional clinical adaptability evaluation framework integrating downstream task validation—establishing a safer, more clinically grounded paradigm for assessing medical generative models.

Technology Category

Application Category

📝 Abstract
Evaluating generative models for synthetic medical imaging is crucial yet challenging, especially given the high standards of fidelity, anatomical accuracy, and safety required for clinical applications. Standard evaluation of generated images often relies on no-reference image quality metrics when ground truth images are unavailable, but their reliability in this complex domain is not well established. This study comprehensively assesses commonly used no-reference image quality metrics using brain MRI data, including tumour and vascular images, providing a representative exemplar for the field. We systematically evaluate metric sensitivity to a range of challenges, including noise, distribution shifts, and, critically, localised morphological alterations designed to mimic clinically relevant inaccuracies. We then compare these metric scores against model performance on a relevant downstream segmentation task, analysing results across both controlled image perturbations and outputs from different generative model architectures. Our findings reveal significant limitations: many widely-used no-reference image quality metrics correlate poorly with downstream task suitability and exhibit a profound insensitivity to localised anatomical details crucial for clinical validity. Furthermore, these metrics can yield misleading scores regarding distribution shifts, e.g. data memorisation. This reveals the risk of misjudging model readiness, potentially leading to the deployment of flawed tools that could compromise patient safety. We conclude that ensuring generative models are truly fit for clinical purpose requires a multifaceted validation framework, integrating performance on relevant downstream tasks with the cautious interpretation of carefully selected no-reference image quality metrics.
Problem

Research questions and friction points this paper is trying to address.

Evaluating reliability of no-reference image quality metrics for medical imaging
Assessing metric sensitivity to noise, distribution shifts, and anatomical inaccuracies
Identifying limitations of current metrics in clinical validity and downstream tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates no-reference image quality metrics
Assesses metric sensitivity to clinical inaccuracies
Compares metrics with downstream task performance
🔎 Similar Papers
Y
Yash Deo
Department of Computer Science, University of York, York, UK
Y
Yan Jia
Department of Computer Science, University of York, York, UK
Toni Lassila
Toni Lassila
Lecturer at University of Leeds
William A. P. Smith
William A. P. Smith
Professor, Department of Computer Science, University of York
Computer VisionComputer GraphicsMachine Learning
Tom Lawton
Tom Lawton
Improvement Academy, Bradford Institute for Health Research
critical careroutine datasimulation modellingartificial intelligencesafety
Siyuan Kang
Siyuan Kang
PhD student, Department of Geography, National University of Singapore
Economic GeographyHealth GeographyGeovisualisationSupply ChainGlobal Trade
A
Alejandro F. Frangi
Department of Computer Science, School of Engineering, University of Manchester, Manchester, UK; School of Health Sciences, Division of Informatics, Imaging and Data Sciences, University of Manchester, Manchester, UK; Department of Cardiovascular Sciences, KU Leuven, Leuven, Belgium; Department of Electrical Engineering (ESAT), KU Leuven, Leuven, Belgium
Ibrahim Habli
Ibrahim Habli
Professor of Safety-Critical Systems at the University of York
SafetyAI SafetyAutonomous SystemsSoftware Engineering