🤖 AI Summary
This study addresses the clinically critical problem of precise brain tumor classification from MRI scans. We conduct the first systematic evaluation of GPT-5 series models on a zero-shot visual question answering (VQA) task integrating multi-sequence, three-plane MRI mosaic images with structured clinical features. Methodologically, we introduce the first cross-modal VQA benchmark specifically designed for neuro-oncology and propose a novel zero-shot chain-of-thought prompting strategy. Experimental results show that GPT-5-mini achieves the highest macro-averaged accuracy of 44.19% (GPT-5: 43.71%), demonstrating preliminary cross-modal medical reasoning capability in large language vision models. However, performance remains substantially below clinical deployment requirements. This work provides essential empirical evidence for assessing the capability boundaries of foundation models in neuroimaging-based intelligent diagnosis and establishes a foundational benchmark for future research.
📝 Abstract
Accurate differentiation of brain tumor types on magnetic resonance imaging (MRI) is critical for guiding treatment planning in neuro-oncology. Recent advances in large language models (LLMs) have enabled visual question answering (VQA) approaches that integrate image interpretation with natural language reasoning. In this study, we evaluated GPT-4o, GPT-5-nano, GPT-5-mini, and GPT-5 on a curated brain tumor VQA benchmark derived from 3 Brain Tumor Segmentation (BraTS) datasets - glioblastoma (GLI), meningioma (MEN), and brain metastases (MET). Each case included multi-sequence MRI triplanar mosaics and structured clinical features transformed into standardized VQA items. Models were assessed in a zero-shot chain-of-thought setting for accuracy on both visual and reasoning tasks. Results showed that GPT-5-mini achieved the highest macro-average accuracy (44.19%), followed by GPT-5 (43.71%), GPT-4o (41.49%), and GPT-5-nano (35.85%). Performance varied by tumor subtype, with no single model dominating across all cohorts. These findings suggest that GPT-5 family models can achieve moderate accuracy in structured neuro-oncological VQA tasks, but not at a level acceptable for clinical use.