🤖 AI Summary
This study addresses the highly specialized task of automatically generating diagnostic impressions from PET/CT imaging reports, where existing large language models exhibit limited performance. The authors introduce PET-F2I-41K, the first benchmark dataset comprising 41,000 real-world clinical reports, and propose three clinically oriented evaluation metrics: Entity Coverage Rate (ECR), Uniqueness Error Rate (UER), and Fact Consistency Rate (FCR). Building upon Qwen2.5-7B-Instruct, they develop PET-F2I-7B through parameter-efficient fine-tuning using LoRA. The resulting model achieves a BLEU-4 score of 0.708 and demonstrates a threefold improvement in entity coverage over the strongest baseline. It further exhibits significant advantages in generation completeness, factual consistency, and practical considerations for clinical deployment—including reduced computational cost, inference latency, and enhanced privacy preservation.
📝 Abstract
PET/CT imaging is pivotal in oncology and nuclear medicine, yet summarizing complex findings into precise diagnostic impressions is labor-intensive. While LLMs have shown promise in medical text generation, their capability in the highly specialized domain of PET/CT remains underexplored. We introduce PET-F2I-41K (PET Findings-to-Impression Benchmark), a large-scale benchmark for PET/CT impression generation using LLMs, constructed from over 41k real-world reports. Using PET-F2I-41K, we conduct a comprehensive evaluation of 27 models across proprietary frontier LLMs, open-source generalist models, and medical-domain LLMs, and we develop a domain-adapted 7B model (PET-F2I-7B) fine-tuned from Qwen2.5-7B-Instruct via LoRA. Beyond standard NLG metrics (e.g., BLEU-4, ROUGE-L, BERTScore), we propose three clinically grounded metrics - Entity Coverage Rate (ECR), Uncovered Entity Rate (UER), and Factual Consistency Rate (FCR) - to assess diagnostic completeness and factual reliability. Experiments reveal that neither frontier nor medical-domain LLMs perform adequately in zero-shot settings. In contrast, PET-F2I-7B achieves substantial gains (e.g., 0.708 BLEU-4) and a 3.0x improvement in entity coverage over the strongest baseline, while offering advantages in cost, latency, and privacy. Beyond this modeling contribution, PET-F2I-41K establishes a standardized evaluation framework to accelerate the development of reliable and clinically deployable reporting systems for PET/CT.