🤖 AI Summary
Medical imaging quality control (QC) suffers from subjectivity and low automation. Method: This study introduces the first standardized QC dataset for chest X-rays and CT reports, and proposes a multimodal human–AI collaborative closed-loop evaluation framework. It innovatively integrates adaptive data governance and dynamic feedback mechanisms, leveraging large language models—including Gemini 2.0-Flash, GPT-4o, DeepSeek-R1, and InternLM2.5-7B-chat—evaluated across recall, precision, and Macro F1. Results: Gemini 2.0-Flash achieves a Macro F1 of 90.0% on chest X-ray QC; DeepSeek-R1 attains 62.23% recall in CT report auditing; InternLM2.5-7B-chat yields the highest additional error detection rate. This work establishes the first trustworthy, iterative, large-model–driven medical imaging QC paradigm, substantially enhancing both efficiency and objectivity of clinical QC processes.
📝 Abstract
Medical imaging quality control (QC) is essential for accurate diagnosis, yet traditional QC methods remain labor-intensive and subjective. To address this challenge, in this study, we establish a standardized dataset and evaluation framework for medical imaging QC, systematically assessing large language models (LLMs) in image quality assessment and report standardization. Specifically, we first constructed and anonymized a dataset of 161 chest X-ray (CXR) radiographs and 219 CT reports for evaluation. Then, multiple LLMs, including Gemini 2.0-Flash, GPT-4o, and DeepSeek-R1, were evaluated based on recall, precision, and F1 score to detect technical errors and inconsistencies. Experimental results show that Gemini 2.0-Flash achieved a Macro F1 score of 90 in CXR tasks, demonstrating strong generalization but limited fine-grained performance. DeepSeek-R1 excelled in CT report auditing with a 62.23% recall rate, outperforming other models. However, its distilled variants performed poorly, while InternLM2.5-7B-chat exhibited the highest additional discovery rate, indicating broader but less precise error detection. These findings highlight the potential of LLMs in medical imaging QC, with DeepSeek-R1 and Gemini 2.0-Flash demonstrating superior performance.