🤖 AI Summary
Current large language models (LLMs) lack systematic, quantitative evaluation of their capabilities in structured pathology report interpretation—specifically cancer typing, AJCC staging, and prognostic assessment.
Method: We conduct the first comprehensive zero-shot evaluation of mainstream LLMs on this task and introduce two pathology-domain instruction-tuned models: Path-Llama3.1-8B and Path-GPT-4o-mini-FT. Our methodology integrates information extraction and high-level clinical reasoning, validated on a benchmark built from diverse, real-world pathology reports.
Contribution/Results: The proposed models achieve significant zero-shot performance gains over general-purpose baselines (+12.6% average F1) in cancer typing, staging, and prognosis prediction, with clinically interpretable outputs. Key contributions include: (1) the first zero-shot benchmark for pathology semantic parsing; (2) open-source, lightweight, domain-adapted instruction-tuned models; and (3) empirical validation of end-to-end LLM-based parsing of unstructured pathology text for clinical utility.
📝 Abstract
Large Language Models (LLMs) have shown significant promise across various natural language processing tasks. However, their application in the field of pathology, particularly for extracting meaningful insights from unstructured medical texts such as pathology reports, remains underexplored and not well quantified. In this project, we leverage state-of-the-art language models, including the GPT family, Mistral models, and the open-source Llama models, to evaluate their performance in comprehensively analyzing pathology reports. Specifically, we assess their performance in cancer type identification, AJCC stage determination, and prognosis assessment, encompassing both information extraction and higher-order reasoning tasks. Based on a detailed analysis of their performance metrics in a zero-shot setting, we developed two instruction-tuned models: Path-llama3.1-8B and Path-GPT-4o-mini-FT. These models demonstrated superior performance in zero-shot cancer type identification, staging, and prognosis assessment compared to the other models evaluated.