Large Language Model-Based Uncertainty-Adjusted Label Extraction for Artificial Intelligence Model Development in Upper Extremity Radiography

📅 2025-10-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of automatically extracting and structuring diagnostic labels—including uncertainty annotations—from unstructured radiology reports for musculoskeletal X-ray multi-label classification. Method: We systematically evaluated GPT-4o’s capability to parse free-text upper-limb radiology reports and generate uncertainty-aware diagnostic labels. These labels were integrated with a ResNet50-based multi-label classifier. Performance was assessed using macro-average AUC and precision–recall curves, and external validation evaluated generalizability across institutions. Contribution/Results: Label extraction achieved 98.6% accuracy. Incorporating uncertainty annotations did not significantly affect classification performance (no statistically significant differences across labeling strategies). The model demonstrated robust performance across anatomical regions, achieving a maximum macro-average AUC of 0.80, and exhibited strong cross-center generalization. This work establishes a reproducible, LLM-driven paradigm for clinical text structuring and weakly supervised medical image modeling, providing empirical evidence for integrating large language models into radiological workflow automation.

Technology Category

Application Category

📝 Abstract
Objectives: To evaluate GPT-4o's ability to extract diagnostic labels (with uncertainty) from free-text radiology reports and to test how these labels affect multi-label image classification of musculoskeletal radiographs. Methods: This retrospective study included radiography series of the clavicle (n=1,170), elbow (n=3,755), and thumb (n=1,978). After anonymization, GPT-4o filled out structured templates by indicating imaging findings as present ("true"), absent ("false"), or "uncertain." To assess the impact of label uncertainty, "uncertain" labels of the training and validation sets were automatically reassigned to "true" (inclusive) or "false" (exclusive). Label-image-pairs were used for multi-label classification using ResNet50. Label extraction accuracy was manually verified on internal (clavicle: n=233, elbow: n=745, thumb: n=393) and external test sets (n=300 for each). Performance was assessed using macro-averaged receiver operating characteristic (ROC) area under the curve (AUC), precision recall curves, sensitivity, specificity, and accuracy. AUCs were compared with the DeLong test. Results: Automatic extraction was correct in 98.6% (60,618 of 61,488) of labels in the test sets. Across anatomic regions, label-based model training yielded competitive performance measured by macro-averaged AUC values for inclusive (e.g., elbow: AUC=0.80 [range, 0.62-0.87]) and exclusive models (elbow: AUC=0.80 [range, 0.61-0.88]). Models generalized well on external datasets (elbow [inclusive]: AUC=0.79 [range, 0.61-0.87]; elbow [exclusive]: AUC=0.79 [range, 0.63-0.89]). No significant differences were observed across labeling strategies or datasets (p>=0.15). Conclusion: GPT-4o extracted labels from radiologic reports to train competitive multi-label classification models with high accuracy. Detected uncertainty in the radiologic reports did not influence the performance of these models.
Problem

Research questions and friction points this paper is trying to address.

Extracting diagnostic labels with uncertainty from radiology reports
Training multi-label classification models using extracted labels
Evaluating impact of label uncertainty on model performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

GPT-4o extracts diagnostic labels with uncertainty
Uncertain labels reassigned for multi-label classification training
ResNet50 model achieves competitive AUC using extracted labels
🔎 Similar Papers
No similar papers found.
H
Hanna Kreutzer
Lab for Artificial Intelligence in Medicine, Department of Diagnostic and Interventional Radiology, University Hospital Aachen, Aachen, Germany
A
Anne-Sophie Caselitz
Lab for Artificial Intelligence in Medicine, Department of Diagnostic and Interventional Radiology, University Hospital Aachen, Aachen, Germany
T
Thomas Dratsch
Institute for Diagnostic and Interventional Radiology, Faculty of Medicine and University Hospital Cologne, University of Cologne, Cologne, Germany
Daniel Pinto dos Santos
Daniel Pinto dos Santos
Department of Diagnostic and Interventional Radiology, University Medical Center Mainz, Mainz, Germany
C
Christiane Kuhl
Department of Diagnostic and Interventional Radiology, University Hospital Aachen, Aachen, Germany
Daniel Truhn
Daniel Truhn
Professor of Radiology, University Hospital Aachen
Machine LearningArtificial IntelligenceComputer VisionMedical Imaging
Sven Nebelung
Sven Nebelung
Department of Diagnostic and Interventional Radiology, University Hospital Aachen
Advanced MRI TechniquesFunctionality AssessmentBiomechanical ImagingCartilageArtificial Intelligence