🤖 AI Summary
This study addresses the challenge of automatically extracting and structuring diagnostic labels—including uncertainty annotations—from unstructured radiology reports for musculoskeletal X-ray multi-label classification.
Method: We systematically evaluated GPT-4o’s capability to parse free-text upper-limb radiology reports and generate uncertainty-aware diagnostic labels. These labels were integrated with a ResNet50-based multi-label classifier. Performance was assessed using macro-average AUC and precision–recall curves, and external validation evaluated generalizability across institutions.
Contribution/Results: Label extraction achieved 98.6% accuracy. Incorporating uncertainty annotations did not significantly affect classification performance (no statistically significant differences across labeling strategies). The model demonstrated robust performance across anatomical regions, achieving a maximum macro-average AUC of 0.80, and exhibited strong cross-center generalization. This work establishes a reproducible, LLM-driven paradigm for clinical text structuring and weakly supervised medical image modeling, providing empirical evidence for integrating large language models into radiological workflow automation.
📝 Abstract
Objectives: To evaluate GPT-4o's ability to extract diagnostic labels (with uncertainty) from free-text radiology reports and to test how these labels affect multi-label image classification of musculoskeletal radiographs. Methods: This retrospective study included radiography series of the clavicle (n=1,170), elbow (n=3,755), and thumb (n=1,978). After anonymization, GPT-4o filled out structured templates by indicating imaging findings as present ("true"), absent ("false"), or "uncertain." To assess the impact of label uncertainty, "uncertain" labels of the training and validation sets were automatically reassigned to "true" (inclusive) or "false" (exclusive). Label-image-pairs were used for multi-label classification using ResNet50. Label extraction accuracy was manually verified on internal (clavicle: n=233, elbow: n=745, thumb: n=393) and external test sets (n=300 for each). Performance was assessed using macro-averaged receiver operating characteristic (ROC) area under the curve (AUC), precision recall curves, sensitivity, specificity, and accuracy. AUCs were compared with the DeLong test. Results: Automatic extraction was correct in 98.6% (60,618 of 61,488) of labels in the test sets. Across anatomic regions, label-based model training yielded competitive performance measured by macro-averaged AUC values for inclusive (e.g., elbow: AUC=0.80 [range, 0.62-0.87]) and exclusive models (elbow: AUC=0.80 [range, 0.61-0.88]). Models generalized well on external datasets (elbow [inclusive]: AUC=0.79 [range, 0.61-0.87]; elbow [exclusive]: AUC=0.79 [range, 0.63-0.89]). No significant differences were observed across labeling strategies or datasets (p>=0.15). Conclusion: GPT-4o extracted labels from radiologic reports to train competitive multi-label classification models with high accuracy. Detected uncertainty in the radiologic reports did not influence the performance of these models.