🤖 AI Summary
This study addresses the challenge of extracting key clinical information—specifically TNM staging, histological grading, and biomarker status—from unstructured pathology reports in breast cancer, which hinders large-scale data utilization. The authors propose a parameter-efficient multi-task framework that fine-tunes the Llama-3-8B-Instruct encoder using Low-Rank Adaptation (LoRA) and incorporates parallel classification heads designed to enforce outputs consistent with a predefined medical schema. Trained on 10,677 expert-annotated pathology reports, the model achieves a Macro F1 score of 0.976 across all three tasks, significantly outperforming rule-based NLP systems, zero-shot large language models, and single-task baselines. The approach demonstrates both high parsing consistency and low computational overhead, offering a scalable solution for structured information extraction from clinical narratives.
📝 Abstract
Pathology reports serve as the definitive record for breast cancer staging, yet their unstructured format impedes large-scale data curation. While Large Language Models (LLMs) offer semantic reasoning, their deployment is often limited by high computational costs and hallucination risks. This study introduces a parameter-efficient, multi-task framework for automating the extraction of Tumor-Node-Metastasis (TNM) staging, histologic grade, and biomarkers. We fine-tune a Llama-3-8B-Instruct encoder using Low-Rank Adaptation (LoRA) on a curated, expert-verified dataset of 10,677 reports. Unlike generative approaches, our architecture utilizes parallel classification heads to enforce consistent schema adherence. Experimental results demonstrate that the model achieves a Macro F1 score of 0.976, successfully resolving complex contextual ambiguities and heterogeneous reporting formats that challenge traditional extraction methods including rule-based natural language processing (NLP) pipelines, zero-shot LLMs, and single-task LLM baselines. The proposed adapter-efficient, multi-task architecture enables reliable, scalable pathology-derived cancer staging and biomarker profiling, with the potential to enhance clinical decision support and accelerate data-driven oncology research.