Improving Drug Identification in Overdose Death Surveillance using Large Language Models

📅 2025-07-16
📈 Citations: 0
Influential: 0
📄 PDF

career value

192K/year
🤖 AI Summary
Drug overdose deaths in the United States—particularly those involving fentanyl—are rising steadily, yet critical cause-of-death information remains buried in unstructured autopsy reports; ICD-10 coding is delayed and prone to misclassification. To address this, we propose a domain-adapted language model approach—fine-tuning BioClinicalBERT—for multi-label drug identification, enabling precise extraction of substances involved from free-text autopsy reports. Unlike conventional machine learning methods or general-purpose large language models, our method demonstrates unprecedented robustness in cross-year external validation, achieving a macro-F1 score of 0.966 (internal test ≥ 0.998), substantially outperforming existing techniques. This advancement enables near real-time surveillance of illicit drug trends, delivering timely, high-fidelity data to inform public health interventions with minimal information loss.

Technology Category

Application Category

📝 Abstract
The rising rate of drug-related deaths in the United States, largely driven by fentanyl, requires timely and accurate surveillance. However, critical overdose data are often buried in free-text coroner reports, leading to delays and information loss when coded into ICD (International Classification of Disease)-10 classifications. Natural language processing (NLP) models may automate and enhance overdose surveillance, but prior applications have been limited. A dataset of 35,433 death records from multiple U.S. jurisdictions in 2020 was used for model training and internal testing. External validation was conducted using a novel separate dataset of 3,335 records from 2023-2024. Multiple NLP approaches were evaluated for classifying specific drug involvement from unstructured death certificate text. These included traditional single- and multi-label classifiers, as well as fine-tuned encoder-only language models such as Bidirectional Encoder Representations from Transformers (BERT) and BioClinicalBERT, and contemporary decoder-only large language models such as Qwen 3 and Llama 3. Model performance was assessed using macro-averaged F1 scores, and 95% confidence intervals were calculated to quantify uncertainty. Fine-tuned BioClinicalBERT models achieved near-perfect performance, with macro F1 scores >=0.998 on the internal test set. External validation confirmed robustness (macro F1=0.966), outperforming conventional machine learning, general-domain BERT models, and various decoder-only large language models. NLP models, particularly fine-tuned clinical variants like BioClinicalBERT, offer a highly accurate and scalable solution for overdose death classification from free-text reports. These methods can significantly accelerate surveillance workflows, overcoming the limitations of manual ICD-10 coding and supporting near real-time detection of emerging substance use trends.
Problem

Research questions and friction points this paper is trying to address.

Automate drug identification in overdose death reports
Improve accuracy of overdose surveillance using NLP
Overcome limitations of manual ICD-10 coding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuned BioClinicalBERT for overdose classification
External validation with 3,335 recent records
Outperforms traditional and general-domain models
🔎 Similar Papers
No similar papers found.
A
Arthur J. Funnell
Medical & Imaging Informatics Group, Department of Radiological Sciences, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, CA
Panayiotis Petousis
Panayiotis Petousis
Unknown affiliation
Fabrice Harel-Canada
Fabrice Harel-Canada
University of California, Los Angeles
Machine LearningSoftware Engineering
R
Ruby Romero
Computational and Systems Biology, University of California, Los Angeles , 102 Hershey Hall, Box 951600, Los Angeles, 90095, CA, USA
A
Alex A. T. Bui
Medical & Imaging Informatics Group, Department of Radiological Sciences, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, CA
A
Adam Koncsol
Division of General Internal Medicine and Health Services Research, University of California, Los Angeles, 1100 Glendon Ave STE 850, Los Angeles, 90024, CA, USA
H
Hritika Chaturvedi
Computational and Systems Biology, University of California, Los Angeles , 102 Hershey Hall, Box 951600, Los Angeles, 90095, CA, USA
C
Chelsea Shover
Division of General Internal Medicine and Health Services Research, University of California, Los Angeles, 1100 Glendon Ave STE 850, Los Angeles, 90024, CA, USA
D
David Goodman-Meza
Kirby Institute, University of New South Wales, Wallace Wurth Building (C27), Cnr High St & Botany St, UNSW, Sydney, 2052, NSW, Australia