Benchmarking GPT-5 in Radiation Oncology: Measurable Gains, but Persistent Need for Expert Oversight

📅 2025-08-29
📈 Citations: 0
✨ Influential: 0
📄 PDF
🤖 AI Summary
Prior to this work, no systematic evaluation existed of large language models (LLMs) for clinical decision support in radiation oncology. Method: We conducted the first comprehensive assessment of GPT-5 using two clinically grounded benchmarks: the American College of Radiology (ACR) standardized multiple-choice examination and expert-validated real-world clinical vignettes. Accuracy, comprehensiveness, and hallucination rates were quantified; inter-rater reliability was assessed via Fleiss’ kappa. Contribution/Results: GPT-5 achieved 92.8% multiple-choice accuracy—significantly surpassing prior LLMs—and scored 3.24/4 on recommendation correctness and 3.59/4 on comprehensiveness, with low but non-negligible hallucination rates, including critical errors. While demonstrating substantive potential as a decision-support tool, GPT-5 cannot replace expert oversight: all outputs require review by board-certified radiation oncologists. This study establishes the first LLM evaluation framework specifically designed for radiation oncology, providing both methodological rigor and empirical evidence to guide safe, responsible AI deployment in clinical radiotherapy practice.

Technology Category

Application Category

📝 Abstract
Introduction: Large language models (LLM) have shown great potential in clinical decision support. GPT-5 is a novel LLM system that has been specifically marketed towards oncology use. Methods: Performance was assessed using two complementary benchmarks: (i) the ACR Radiation Oncology In-Training Examination (TXIT, 2021), comprising 300 multiple-choice items, and (ii) a curated set of 60 authentic radiation oncologic vignettes representing diverse disease sites and treatment indications. For the vignette evaluation, GPT-5 was instructed to generate concise therapeutic plans. Four board-certified radiation oncologists rated correctness, comprehensiveness, and hallucinations. Inter-rater reliability was quantified using Fleiss' k{appa}. Results: On the TXIT benchmark, GPT-5 achieved a mean accuracy of 92.8%, outperforming GPT-4 (78.8%) and GPT-3.5 (62.1%). Domain-specific gains were most pronounced in Dose and Diagnosis. In the vignette evaluation, GPT-5's treatment recommendations were rated highly for correctness (mean 3.24/4, 95% CI: 3.11-3.38) and comprehensiveness (3.59/4, 95% CI: 3.49-3.69). Hallucinations were rare with no case reaching majority consensus for their presence. Inter-rater agreement was low (Fleiss' k{appa} 0.083 for correctness), reflecting inherent variability in clinical judgment. Errors clustered in complex scenarios requiring precise trial knowledge or detailed clinical adaptation. Discussion: GPT-5 clearly outperformed prior model variants on the radiation oncology multiple-choice benchmark. Although GPT-5 exhibited favorable performance in generating real-world radiation oncology treatment recommendations, correctness ratings indicate room for further improvement. While hallucinations were infrequent, the presence of substantive errors underscores that GPT-5-generated recommendations require rigorous expert oversight before clinical implementation.
Problem

Research questions and friction points this paper is trying to address.

Evaluating GPT-5's performance in radiation oncology decision-making
Assessing clinical correctness and hallucinations in treatment recommendations
Determining need for expert oversight in AI-generated oncology plans
Innovation

Methods, ideas, or system contributions that make the work stand out.

GPT-5 evaluated using multiple-choice oncology exam
Real-world vignettes assessed by board-certified oncologists
Performance measured via correctness and hallucination metrics
🔎 Similar Papers
No similar papers found.
U
Ugur Dinc
Department of Radiation Oncology, University Hospital Erlangen, Friedrich-Alexander-Universität Erlangen-Nßrnberg, Erlangen, Germany
J
Jibak Sarkar
Department of Radiation Oncology, University Hospital Erlangen, Friedrich-Alexander-Universität Erlangen-Nßrnberg, Erlangen, Germany
P
Philipp Schubert
Department of Radiation Oncology, University Hospital Erlangen, Friedrich-Alexander-Universität Erlangen-Nßrnberg, Erlangen, Germany
Sabine Semrau
Sabine Semrau
Department of Radiation Oncology, University Hospital Erlangen, Friedrich-Alexander-Universität Erlangen-Nßrnberg, Erlangen, Germany
T
Thomas Weissmann
Department of Radiation Oncology, University Hospital Erlangen, Friedrich-Alexander-Universität Erlangen-Nßrnberg, Erlangen, Germany
A
Andre Karius
Department of Radiation Oncology, University Hospital Erlangen, Friedrich-Alexander-Universität Erlangen-Nßrnberg, Erlangen, Germany
J
Johann Brand
Department of Radiation Oncology, University Hospital Erlangen, Friedrich-Alexander-Universität Erlangen-Nßrnberg, Erlangen, Germany
B
Bernd-Niklas Axer
Department of Radiation Oncology, University Hospital Erlangen, Friedrich-Alexander-Universität Erlangen-Nßrnberg, Erlangen, Germany
Ahmed Gomaa
Ahmed Gomaa
PhD Candidate
medical imagingradiation oncologydeep learningsurvival analysis
P
Pluvio Stephan
Department of Radiation Oncology, University Hospital Erlangen, Friedrich-Alexander-Universität Erlangen-Nßrnberg, Erlangen, Germany
I
Ishita Sheth
Department of Radiation Oncology, University Hospital Erlangen, Friedrich-Alexander-Universität Erlangen-Nßrnberg, Erlangen, Germany
S
Sogand Beirami
Department of Radiation Oncology, University Hospital Erlangen, Friedrich-Alexander-Universität Erlangen-Nßrnberg, Erlangen, Germany
A
Annette Schwarz
Department of Radiation Oncology, University Hospital Erlangen, Friedrich-Alexander-Universität Erlangen-Nßrnberg, Erlangen, Germany
Udo Gaipl
Udo Gaipl
Professor for Radiation Immunobiology, Universitätsklinikum Erlangen
Immune modulation by radiationradioimmunotherapies
Benjamin Frey
Benjamin Frey
Department for Radiation Oncology, Universitätsklinikum Erlangen
Radiation Oncology
Christoph Bert
Christoph Bert
Professor fĂźr Medizinische Strahlenphysik, FAU Erlangen-NĂźrnberg
radiation oncologymedical physics
S
Stefanie Corradini
Department of Radiation Oncology, University Hospital, Ludwig Maximilian University of Munich, Munich, Germany
Rainer Fietkau
Rainer Fietkau
Universitätsklinikum Erlangen, Department of Radiation Oncology
Radiation Oncology
F
Florian Putz
Department of Radiation Oncology, University Hospital Erlangen, Friedrich-Alexander-Universität Erlangen-Nßrnberg, Erlangen, Germany