Evaluating Large Language Models for Zero-Shot Disease Labeling in CT Radiology Reports Across Organ Systems

📅 2025-06-03

📈 Citations: 0

✨ Influential: 0

career value

156K/year

🤖 AI Summary

This study investigates the efficacy and clinical consistency of lightweight open-source large language models (LLMs) in zero-shot disease annotation of multi-organ CT radiology reports (thoracic, abdominal, pelvic). Using zero-shot prompting, we systematically evaluate the cross-organ generalization capabilities of Llama-3.1 8B and Gemma-3 27B, rigorously validating performance via Cohen’s Kappa, micro/macro-F1 scores, and the external benchmark dataset CT-RATE. Results show Gemma-3 27B achieves the highest macro-F1 of 0.82; Llama-3.1 8B attains an F1 of 0.91 on CT-RATE; both models yield a median Kappa of 0.87—significantly surpassing rule-based baselines (F1 = 0.64). To our knowledge, this is the first systematic demonstration that lightweight LLMs can achieve high robustness and strong clinical agreement in cross-organ CT semantic interpretation without fine-tuning. The findings establish a novel paradigm for automating clinical radiology reporting.

Technology Category

Application Category

📝 Abstract

Purpose: This study aims to evaluate the effectiveness of large language models (LLMs) in automating disease annotation of CT radiology reports. We compare a rule-based algorithm (RBA), RadBERT, and three lightweight open-weight LLMs for multi-disease labeling of chest, abdomen, and pelvis (CAP) CT reports. Materials and Methods: This retrospective study analyzed 40,833 CT reports from 29,540 patients, with 1,789 CAP reports manually annotated across three organ systems. External validation was conducted using the CT-RATE dataset. Three open-weight LLMs were tested with zero-shot prompting. Performance was evaluated using Cohen's Kappa and micro/macro-averaged F1 scores. Results: In 12,197 Duke CAP reports from 8,854 patients, Llama-3.1 8B and Gemma-3 27B showed the highest agreement ($kappa$ median: 0.87). On the manually annotated set, Gemma-3 27B achieved the top macro-F1 (0.82), followed by Llama-3.1 8B (0.79), while the RBA scored lowest (0.64). On the CT-RATE dataset (lungs/pleura only), Llama-3.1 8B performed best (0.91), with Gemma-3 27B close behind (0.89). Performance differences were mainly due to differing labeling practices, especially for lung atelectasis. Conclusion: Lightweight LLMs outperform rule-based methods for CT report annotation and generalize across organ systems with zero-shot prompting. However, binary labels alone cannot capture the full nuance of report language. LLMs can provide a flexible, efficient solution aligned with clinical judgment and user needs.

Problem

Research questions and friction points this paper is trying to address.

Evaluate LLMs for zero-shot disease labeling in CT reports

Compare rule-based and LLM methods for multi-disease annotation

Assess generalization across organ systems with zero-shot prompting

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight LLMs for zero-shot disease labeling

Comparison with rule-based and RadBERT methods

Evaluation using Cohen's Kappa and F1 scores

🔎 Similar Papers

No similar papers found.