Leveraging Fine-Tuned Large Language Models for Interpretable Pancreatic Cystic Lesion Feature Extraction and Risk Categorization

📅 2025-07-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Manual annotation of radiology reports for pancreatic cystic lesions (PCLs) is time-intensive and hinders large-scale clinical research. Method: We propose a large language model (LLM)-based framework for automated PCL information extraction and risk stratification, combining chain-of-thought prompting with lightweight fine-tuning. High-quality, reasoning-augmented training data were synthesized using GPT-4o; open-source LLMs (e.g., LLaMA, DeepSeek) were fine-tuned via QLoRA; and risk categories were mapped to clinical guidelines (e.g., Fukuoka, AGA). Contribution/Results: Evaluated on 285 MRI/CT reports, our method achieves 97–98% accuracy in extracting key morphological features and an F1-score of 0.95 for risk classification. Inter-model agreement aligns statistically with expert radiologists. To our knowledge, this is the first fully automated, interpretable, and clinically aligned PCL structuring approach that eliminates manual annotation—enhancing scalability for research and accelerating clinical translation.

Technology Category

Application Category

📝 Abstract
Background: Manual extraction of pancreatic cystic lesion (PCL) features from radiology reports is labor-intensive, limiting large-scale studies needed to advance PCL research. Purpose: To develop and evaluate large language models (LLMs) that automatically extract PCL features from MRI/CT reports and assign risk categories based on guidelines. Materials and Methods: We curated a training dataset of 6,000 abdominal MRI/CT reports (2005-2024) from 5,134 patients that described PCLs. Labels were generated by GPT-4o using chain-of-thought (CoT) prompting to extract PCL and main pancreatic duct features. Two open-source LLMs were fine-tuned using QLoRA on GPT-4o-generated CoT data. Features were mapped to risk categories per institutional guideline based on the 2017 ACR White Paper. Evaluation was performed on 285 held-out human-annotated reports. Model outputs for 100 cases were independently reviewed by three radiologists. Feature extraction was evaluated using exact match accuracy, risk categorization with macro-averaged F1 score, and radiologist-model agreement with Fleiss' Kappa. Results: CoT fine-tuning improved feature extraction accuracy for LLaMA (80% to 97%) and DeepSeek (79% to 98%), matching GPT-4o (97%). Risk categorization F1 scores also improved (LLaMA: 0.95; DeepSeek: 0.94), closely matching GPT-4o (0.97), with no statistically significant differences. Radiologist inter-reader agreement was high (Fleiss' Kappa = 0.888) and showed no statistically significant difference with the addition of DeepSeek-FT-CoT (Fleiss' Kappa = 0.893) or GPT-CoT (Fleiss' Kappa = 0.897), indicating that both models achieved agreement levels on par with radiologists. Conclusion: Fine-tuned open-source LLMs with CoT supervision enable accurate, interpretable, and efficient phenotyping for large-scale PCL research, achieving performance comparable to GPT-4o.
Problem

Research questions and friction points this paper is trying to address.

Automate PCL feature extraction from radiology reports
Improve risk categorization using fine-tuned LLMs
Achieve radiologist-level agreement in PCL analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuned LLMs for pancreatic cyst feature extraction
Chain-of-thought prompting enhances model interpretability
QLoRA optimizes open-source model performance efficiently
🔎 Similar Papers
No similar papers found.
E
Ebrahim Rasromani
Center for Data Science, New York University, New York, NY, USA
S
Stella K. Kang
Department of Radiology, Columbia University Irving Medical Center, New York, NY, USA
Y
Yanqi Xu
Center for Data Science, New York University, New York, NY, USA
B
Beisong Liu
Center for Data Science, New York University, New York, NY, USA
G
Garvit Luhadia
Center for Data Science, New York University, New York, NY, USA
Wan Fung Chui
Wan Fung Chui
NYU Langone Health / NYU Grossman School of Medicine
radiologyartificial intelligence
F
Felicia L. Pasadyn
Department of Radiology, NYU Grossman School of Medicine, New York, NY, USA
Y
Yu Chih Hung
Department of Radiology, NYU Grossman School of Medicine, New York, NY, USA
J
Julie Y. An
Department of Radiology, NYU Grossman School of Medicine; Department of Radiology, University of California San Diego, La Jolla, CA, USA
E
Edwin Mathieu
Department of Radiology, NYU Grossman School of Medicine, New York, NY, USA
Z
Zehui Gu
Department of Radiology, Columbia University Irving Medical Center, New York, NY, USA
Carlos Fernandez-Granda
Carlos Fernandez-Granda
Courant Institute and Center for Data Science, New York University
machine learningimage processingdata-driven medicinesignal processinginverse problems
A
Ammar A. Javed
Department of Surgery, NYU Grossman School of Medicine, New York, NY, USA
G
Greg D. Sacks
Department of Surgery, NYU Grossman School of Medicine, New York, NY, USA
Tamas Gonda
Tamas Gonda
Professor, NYU Langone Health
MedicineGI
C
Chenchan Huang
Department of Radiology, NYU Grossman School of Medicine, New York, NY, USA
Yiqiu Shen
Yiqiu Shen
New York University
Deep LearningMedical Image ProcessingWeakly Supervised LearningInterpretability