Is ChatGPT-5 Ready for Mammogram VQA?

📅 2025-08-15

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

General-purpose multimodal large language models (MLLMs) face domain adaptation challenges in medical visual question answering (VQA), particularly for breast cancer screening tasks. Method: This study systematically evaluates the GPT-5 family against GPT-4o on three critical mammographic VQA tasks—BI-RADS assessment, abnormality detection, and malignancy classification—under a zero-shot, cross-dataset setting across four public breast imaging datasets. Contribution/Results: GPT-5 achieves a significant performance leap over GPT-4o across most metrics; however, its sensitivity (63.5%) and specificity (52.3%) remain substantially below radiologist-level performance. This work provides the first empirical benchmark of state-of-the-art closed-source MLLMs for clinical breast cancer screening assistance, revealing fundamental limitations in domain-specific reasoning and calibration. It establishes foundational insights into the clinical adaptability gap of generalist vision-language models and offers methodological guidance for future medical VQA development.

Technology Category

Application Category

📝 Abstract

Mammogram visual question answering (VQA) integrates image interpretation with clinical reasoning and has potential to support breast cancer screening. We systematically evaluated the GPT-5 family and GPT-4o model on four public mammography datasets (EMBED, InBreast, CMMD, CBIS-DDSM) for BI-RADS assessment, abnormality detection, and malignancy classification tasks. GPT-5 consistently was the best performing model but lagged behind both human experts and domain-specific fine-tuned models. On EMBED, GPT-5 achieved the highest scores among GPT variants in density (56.8%), distortion (52.5%), mass (64.5%), calcification (63.5%), and malignancy (52.8%) classification. On InBreast, it attained 36.9% BI-RADS accuracy, 45.9% abnormality detection, and 35.0% malignancy classification. On CMMD, GPT-5 reached 32.3% abnormality detection and 55.0% malignancy accuracy. On CBIS-DDSM, it achieved 69.3% BI-RADS accuracy, 66.0% abnormality detection, and 58.2% malignancy accuracy. Compared with human expert estimations, GPT-5 exhibited lower sensitivity (63.5%) and specificity (52.3%). While GPT-5 exhibits promising capabilities for screening tasks, its performance remains insufficient for high-stakes clinical imaging applications without targeted domain adaptation and optimization. However, the tremendous improvements in performance from GPT-4o to GPT-5 show a promising trend in the potential for general large language models (LLMs) to assist with mammography VQA tasks.

Problem

Research questions and friction points this paper is trying to address.

Evaluating GPT-5 for mammogram visual question answering tasks

Comparing GPT-5 performance with human experts and fine-tuned models

Assessing GPT-5's potential in clinical breast cancer screening

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated GPT-5 on mammography datasets

Compared performance with human experts

Highlighted need for domain adaptation

🔎 Similar Papers

No similar papers found.