CancerLLM: A Large Language Model in Cancer Domain

๐Ÿ“… 2024-06-15
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 2
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the lack of lightweight, domain-specialized, and robust large language models (LLMs) for cancer phenotyping and diagnosis, this work introduces the first fine-grained, multi-cancer, lightweight medical LLM (7B parameters). Built upon the Mistral architecture, it undergoes domain-specific self-supervised pretraining and task-oriented fine-tuning using 2.7 million clinical notes and 515,000 pathology reports. The model achieves state-of-the-art performance in phenotypic entity extraction (F1 = 91.78%) and diagnostic statement generation (F1 = 86.81%), outperforming existing methods by an average of 9.23% while significantly reducing GPU memory footprint and inference latency. Its core contributions are threefold: (1) the first fine-grained, multi-cancerโ€“specific adaptation of a lightweight LLM; (2) dual-source, heterogeneous data-driven training leveraging both clinical and pathological textual corpora; and (3) a balanced design achieving high clinical accuracy, computational efficiency, and robustness in real-world medical settings.

Technology Category

Application Category

๐Ÿ“ Abstract
Medical Large Language Models (LLMs) have demonstrated impressive performance on a wide variety of medical NLP tasks; however, there still lacks a LLM specifically designed for phenotyping identification and diagnosis in cancer domain. Moreover, these LLMs typically have several billions of parameters, making them computationally expensive for healthcare systems. Thus, in this study, we propose CancerLLM, a model with 7 billion parameters and a Mistral-style architecture, pre-trained on nearly 2.7M clinical notes and over 515K pathology reports covering 17 cancer types, followed by fine-tuning on two cancer-relevant tasks, including cancer phenotypes extraction and cancer diagnosis generation. Our evaluation demonstrated that the CancerLLM achieves state-of-the-art results with F1 score of 91.78% on phenotyping extraction and 86.81% on disganois generation. It outperformed existing LLMs, with an average F1 score improvement of 9.23%. Additionally, the CancerLLM demonstrated its efficiency on time and GPU usage, and robustness comparing with other LLMs. We demonstrated that CancerLLM can potentially provide an effective and robust solution to advance clinical research and practice in cancer domain
Problem

Research questions and friction points this paper is trying to address.

Develops a specialized LLM for cancer phenotyping and diagnosis
Reduces computational cost with a 7B-parameter efficient model
Improves accuracy in cancer tasks over existing medical LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

7B-parameter Mistral-style model for cancer
Pre-trained on 2.7M clinical cancer notes
Fine-tuned for phenotyping and diagnosis tasks
๐Ÿ”Ž Similar Papers
No similar papers found.
M
Mingchen Li
Division of Computational Health Sciences, University of Minnesota Twin Cities
A
Anne Blaes
Division of Hematology, Oncology and Transplantation, University of Minnesota Twin Cities
Steven Johnson
Steven Johnson
Professor of Applied Mathematics and Physics, Massachusetts Institute of Technology
H
Hongfang Liu
McWilliams School of Biomedical Informatics, UTHealth Houston
H
Hualei Xu
Department of Biomedical Informatics and Data Science, Yale School of Medicine
R
Rui Zhang
Division of Computational Health Sciences, University of Minnesota Twin Cities