Multi-Task LLM with LoRA Fine-Tuning for Automated Cancer Staging and Biomarker Extraction

📅 2026-04-14

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

This study addresses the challenge of extracting key clinical information—specifically TNM staging, histological grading, and biomarker status—from unstructured pathology reports in breast cancer, which hinders large-scale data utilization. The authors propose a parameter-efficient multi-task framework that fine-tunes the Llama-3-8B-Instruct encoder using Low-Rank Adaptation (LoRA) and incorporates parallel classification heads designed to enforce outputs consistent with a predefined medical schema. Trained on 10,677 expert-annotated pathology reports, the model achieves a Macro F1 score of 0.976 across all three tasks, significantly outperforming rule-based NLP systems, zero-shot large language models, and single-task baselines. The approach demonstrates both high parsing consistency and low computational overhead, offering a scalable solution for structured information extraction from clinical narratives.

Technology Category

Application Category

📝 Abstract

Pathology reports serve as the definitive record for breast cancer staging, yet their unstructured format impedes large-scale data curation. While Large Language Models (LLMs) offer semantic reasoning, their deployment is often limited by high computational costs and hallucination risks. This study introduces a parameter-efficient, multi-task framework for automating the extraction of Tumor-Node-Metastasis (TNM) staging, histologic grade, and biomarkers. We fine-tune a Llama-3-8B-Instruct encoder using Low-Rank Adaptation (LoRA) on a curated, expert-verified dataset of 10,677 reports. Unlike generative approaches, our architecture utilizes parallel classification heads to enforce consistent schema adherence. Experimental results demonstrate that the model achieves a Macro F1 score of 0.976, successfully resolving complex contextual ambiguities and heterogeneous reporting formats that challenge traditional extraction methods including rule-based natural language processing (NLP) pipelines, zero-shot LLMs, and single-task LLM baselines. The proposed adapter-efficient, multi-task architecture enables reliable, scalable pathology-derived cancer staging and biomarker profiling, with the potential to enhance clinical decision support and accelerate data-driven oncology research.

Problem

Research questions and friction points this paper is trying to address.

cancer staging

biomarker extraction

pathology reports

unstructured data

TNM classification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-task LLM

LoRA fine-tuning

Cancer staging