Evaluating Large Language Models for Zero-Shot Disease Labeling in CT Radiology Reports Across Organ Systems

📅 2025-06-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the efficacy and clinical consistency of lightweight open-source large language models (LLMs) in zero-shot disease annotation of multi-organ CT radiology reports (thoracic, abdominal, pelvic). Using zero-shot prompting, we systematically evaluate the cross-organ generalization capabilities of Llama-3.1 8B and Gemma-3 27B, rigorously validating performance via Cohen’s Kappa, micro/macro-F1 scores, and the external benchmark dataset CT-RATE. Results show Gemma-3 27B achieves the highest macro-F1 of 0.82; Llama-3.1 8B attains an F1 of 0.91 on CT-RATE; both models yield a median Kappa of 0.87—significantly surpassing rule-based baselines (F1 = 0.64). To our knowledge, this is the first systematic demonstration that lightweight LLMs can achieve high robustness and strong clinical agreement in cross-organ CT semantic interpretation without fine-tuning. The findings establish a novel paradigm for automating clinical radiology reporting.

Technology Category

Application Category

📝 Abstract
Purpose: This study aims to evaluate the effectiveness of large language models (LLMs) in automating disease annotation of CT radiology reports. We compare a rule-based algorithm (RBA), RadBERT, and three lightweight open-weight LLMs for multi-disease labeling of chest, abdomen, and pelvis (CAP) CT reports. Materials and Methods: This retrospective study analyzed 40,833 CT reports from 29,540 patients, with 1,789 CAP reports manually annotated across three organ systems. External validation was conducted using the CT-RATE dataset. Three open-weight LLMs were tested with zero-shot prompting. Performance was evaluated using Cohen's Kappa and micro/macro-averaged F1 scores. Results: In 12,197 Duke CAP reports from 8,854 patients, Llama-3.1 8B and Gemma-3 27B showed the highest agreement ($kappa$ median: 0.87). On the manually annotated set, Gemma-3 27B achieved the top macro-F1 (0.82), followed by Llama-3.1 8B (0.79), while the RBA scored lowest (0.64). On the CT-RATE dataset (lungs/pleura only), Llama-3.1 8B performed best (0.91), with Gemma-3 27B close behind (0.89). Performance differences were mainly due to differing labeling practices, especially for lung atelectasis. Conclusion: Lightweight LLMs outperform rule-based methods for CT report annotation and generalize across organ systems with zero-shot prompting. However, binary labels alone cannot capture the full nuance of report language. LLMs can provide a flexible, efficient solution aligned with clinical judgment and user needs.
Problem

Research questions and friction points this paper is trying to address.

Evaluate LLMs for zero-shot disease labeling in CT reports
Compare rule-based and LLM methods for multi-disease annotation
Assess generalization across organ systems with zero-shot prompting
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight LLMs for zero-shot disease labeling
Comparison with rule-based and RadBERT methods
Evaluation using Cohen's Kappa and F1 scores
🔎 Similar Papers
No similar papers found.
M
Michael E. Garcia-Alcoser
Center for Virtual Imaging Trials, Carl E. Ravin Advanced Imaging Laboratories, Department of Radiology, Duke University School of Medicine, Durham, NC, USA.
M
Mobina Ghojoghnejad
Center for Virtual Imaging Trials, Carl E. Ravin Advanced Imaging Laboratories, Department of Radiology, Duke University School of Medicine, Durham, NC, USA.
F
F. I. Tushar
Center for Virtual Imaging Trials, Carl E. Ravin Advanced Imaging Laboratories, Department of Radiology, Duke University School of Medicine, Durham, NC, USA.
D
David Kim
Center for Virtual Imaging Trials, Carl E. Ravin Advanced Imaging Laboratories, Department of Radiology, Duke University School of Medicine, Durham, NC, USA.
Kyle J. Lafata
Kyle J. Lafata
Thaddeus V. Samulski Associate Professor, Duke University
computational oncologymathematical oncologyapplied mathematicsimagingradiation biology
G
Geoff D. Rubin
Center for Virtual Imaging Trials, Carl E. Ravin Advanced Imaging Laboratories, Department of Radiology, Duke University School of Medicine, Durham, NC, USA.
Joseph Y. Lo
Joseph Y. Lo
Professor of Radiology, Biomed. Engineering, Elec. Engineering, Med Physics
medical imagingmachine learning