Is ChatGPT-5 Ready for Mammogram VQA?

📅 2025-08-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
General-purpose multimodal large language models (MLLMs) face domain adaptation challenges in medical visual question answering (VQA), particularly for breast cancer screening tasks. Method: This study systematically evaluates the GPT-5 family against GPT-4o on three critical mammographic VQA tasks—BI-RADS assessment, abnormality detection, and malignancy classification—under a zero-shot, cross-dataset setting across four public breast imaging datasets. Contribution/Results: GPT-5 achieves a significant performance leap over GPT-4o across most metrics; however, its sensitivity (63.5%) and specificity (52.3%) remain substantially below radiologist-level performance. This work provides the first empirical benchmark of state-of-the-art closed-source MLLMs for clinical breast cancer screening assistance, revealing fundamental limitations in domain-specific reasoning and calibration. It establishes foundational insights into the clinical adaptability gap of generalist vision-language models and offers methodological guidance for future medical VQA development.

Technology Category

Application Category

📝 Abstract
Mammogram visual question answering (VQA) integrates image interpretation with clinical reasoning and has potential to support breast cancer screening. We systematically evaluated the GPT-5 family and GPT-4o model on four public mammography datasets (EMBED, InBreast, CMMD, CBIS-DDSM) for BI-RADS assessment, abnormality detection, and malignancy classification tasks. GPT-5 consistently was the best performing model but lagged behind both human experts and domain-specific fine-tuned models. On EMBED, GPT-5 achieved the highest scores among GPT variants in density (56.8%), distortion (52.5%), mass (64.5%), calcification (63.5%), and malignancy (52.8%) classification. On InBreast, it attained 36.9% BI-RADS accuracy, 45.9% abnormality detection, and 35.0% malignancy classification. On CMMD, GPT-5 reached 32.3% abnormality detection and 55.0% malignancy accuracy. On CBIS-DDSM, it achieved 69.3% BI-RADS accuracy, 66.0% abnormality detection, and 58.2% malignancy accuracy. Compared with human expert estimations, GPT-5 exhibited lower sensitivity (63.5%) and specificity (52.3%). While GPT-5 exhibits promising capabilities for screening tasks, its performance remains insufficient for high-stakes clinical imaging applications without targeted domain adaptation and optimization. However, the tremendous improvements in performance from GPT-4o to GPT-5 show a promising trend in the potential for general large language models (LLMs) to assist with mammography VQA tasks.
Problem

Research questions and friction points this paper is trying to address.

Evaluating GPT-5 for mammogram visual question answering tasks
Comparing GPT-5 performance with human experts and fine-tuned models
Assessing GPT-5's potential in clinical breast cancer screening
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated GPT-5 on mammography datasets
Compared performance with human experts
Highlighted need for domain adaptation
🔎 Similar Papers
No similar papers found.
Q
Qiang Li
Department of Radiation Oncology, Winship Cancer Institute, Emory University School of Medicine
Shansong Wang
Shansong Wang
Postdoctoral Research Fellow at Emory University
computer visionmultimodal learningfoundation model
M
Mingzhe Hu
Department of Radiation Oncology, Winship Cancer Institute, Emory University School of Medicine
Mojtaba Safari
Mojtaba Safari
Postdoctoral Fellow, Emory University
Medical PhysicsMRIMedical Image Analysis
Z
Zachary Eidex
Department of Radiation Oncology, Winship Cancer Institute, Emory University School of Medicine
X
Xiaofeng Yang
Department of Radiation Oncology, Winship Cancer Institute, Emory University School of Medicine