🤖 AI Summary
Manual annotation of radiology reports for pancreatic cystic lesions (PCLs) is time-intensive and hinders large-scale clinical research.
Method: We propose a large language model (LLM)-based framework for automated PCL information extraction and risk stratification, combining chain-of-thought prompting with lightweight fine-tuning. High-quality, reasoning-augmented training data were synthesized using GPT-4o; open-source LLMs (e.g., LLaMA, DeepSeek) were fine-tuned via QLoRA; and risk categories were mapped to clinical guidelines (e.g., Fukuoka, AGA).
Contribution/Results: Evaluated on 285 MRI/CT reports, our method achieves 97–98% accuracy in extracting key morphological features and an F1-score of 0.95 for risk classification. Inter-model agreement aligns statistically with expert radiologists. To our knowledge, this is the first fully automated, interpretable, and clinically aligned PCL structuring approach that eliminates manual annotation—enhancing scalability for research and accelerating clinical translation.
📝 Abstract
Background: Manual extraction of pancreatic cystic lesion (PCL) features from radiology reports is labor-intensive, limiting large-scale studies needed to advance PCL research. Purpose: To develop and evaluate large language models (LLMs) that automatically extract PCL features from MRI/CT reports and assign risk categories based on guidelines. Materials and Methods: We curated a training dataset of 6,000 abdominal MRI/CT reports (2005-2024) from 5,134 patients that described PCLs. Labels were generated by GPT-4o using chain-of-thought (CoT) prompting to extract PCL and main pancreatic duct features. Two open-source LLMs were fine-tuned using QLoRA on GPT-4o-generated CoT data. Features were mapped to risk categories per institutional guideline based on the 2017 ACR White Paper. Evaluation was performed on 285 held-out human-annotated reports. Model outputs for 100 cases were independently reviewed by three radiologists. Feature extraction was evaluated using exact match accuracy, risk categorization with macro-averaged F1 score, and radiologist-model agreement with Fleiss' Kappa. Results: CoT fine-tuning improved feature extraction accuracy for LLaMA (80% to 97%) and DeepSeek (79% to 98%), matching GPT-4o (97%). Risk categorization F1 scores also improved (LLaMA: 0.95; DeepSeek: 0.94), closely matching GPT-4o (0.97), with no statistically significant differences. Radiologist inter-reader agreement was high (Fleiss' Kappa = 0.888) and showed no statistically significant difference with the addition of DeepSeek-FT-CoT (Fleiss' Kappa = 0.893) or GPT-CoT (Fleiss' Kappa = 0.897), indicating that both models achieved agreement levels on par with radiologists. Conclusion: Fine-tuned open-source LLMs with CoT supervision enable accurate, interpretable, and efficient phenotyping for large-scale PCL research, achieving performance comparable to GPT-4o.