Automated Identification of Incidentalomas Requiring Follow-Up: A Multi-Anatomy Evaluation of LLM-Based and Supervised Approaches

📅 2025-12-05

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

Current methods for identifying follow-up requirements for incidentalomas in radiology reports operate only at the document level, lacking lesion-level localization granularity. Method: We propose a novel large language model (LLM) inference paradigm integrating anatomy-aware prompting with lesion-specific markup inputs. We evaluate LLMs—including Llama 3.1-8B, GPT-4o, and GPT-OSS-20b—against supervised baselines (e.g., BioClinicalModernBERT) at the lesion level. Crucially, we incorporate anatomical structural priors into prompt design to enhance multi-site lesion localization and classification. Results: Anatomy-enhanced GPT-OSS-20b achieves a macro-F1 of 0.79, outperforming all supervised models; an integrated system further improves performance to 0.90—approaching inter-annotator agreement (Cohen’s κ = 0.92). This work delivers an interpretable, high-accuracy, fine-grained solution for clinical incidentaloma triage.

Technology Category

Application Category

📝 Abstract

Objective: To evaluate large language models (LLMs) against supervised baselines for fine-grained, lesion-level detection of incidentalomas requiring follow-up, addressing the limitations of current document-level classification systems. Methods: We utilized a dataset of 400 annotated radiology reports containing 1,623 verified lesion findings. We compared three supervised transformer-based encoders (BioClinicalModernBERT, ModernBERT, Clinical Longformer) against four generative LLM configurations (Llama 3.1-8B, GPT-4o, GPT-OSS-20b). We introduced a novel inference strategy using lesion-tagged inputs and anatomy-aware prompting to ground model reasoning. Performance was evaluated using class-specific F1-scores. Results: The anatomy-informed GPT-OSS-20b model achieved the highest performance, yielding an incidentaloma-positive macro-F1 of 0.79. This surpassed all supervised baselines (maximum macro-F1: 0.70) and closely matched the inter-annotator agreement of 0.76. Explicit anatomical grounding yielded statistically significant performance gains across GPT-based models (p < 0.05), while a majority-vote ensemble of the top systems further improved the macro-F1 to 0.90. Error analysis revealed that anatomy-aware LLMs demonstrated superior contextual reasoning in distinguishing actionable findings from benign lesions. Conclusion: Generative LLMs, when enhanced with structured lesion tagging and anatomical context, significantly outperform traditional supervised encoders and achieve performance comparable to human experts. This approach offers a reliable, interpretable pathway for automated incidental finding surveillance in radiology workflows.

Problem

Research questions and friction points this paper is trying to address.

Automated detection of incidentalomas needing follow-up in radiology reports

Evaluating LLMs versus supervised methods for lesion-level classification

Addressing limitations of document-level systems with anatomy-aware prompting

Innovation

Methods, ideas, or system contributions that make the work stand out.

Used lesion-tagged inputs and anatomy-aware prompting for LLMs

Introduced a novel inference strategy to ground model reasoning

Applied a majority-vote ensemble to further improve performance

🔎 Similar Papers

Large Language Models for Disease Diagnosis: A Scoping Review