Fine-tuning foundational models to code diagnoses from veterinary health records

📅 2024-10-19

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Inconsistent diagnostic coding and poor interoperability across institutions and species hinder effective utilization of veterinary electronic health records (EHRs). Method: We propose an automated SNOMED-CT diagnostic coding framework leveraging large language models (LLMs), fine-tuning ten open-source Transformer architectures on 246,000 manually annotated clinical notes from the Colorado State University Veterinary Teaching Hospital. Contribution/Results: To our knowledge, this is the first approach achieving full coverage mapping to all 7,739 SNOMED-CT diagnosis codes used clinically in that institution. The best-performing model achieves an F1-score of 0.82—significantly outperforming baselines such as DeepTag and VetTag. Notably, even non-clinically pre-trained LLMs attain F1 > 0.78 under limited annotation budgets, demonstrating robust generalizability and feasibility in resource-constrained settings. This work establishes a scalable, low-cost paradigm for interoperable, cross-institutional integration of veterinary health data.

Technology Category

Application Category

📝 Abstract

Veterinary medical records represent a large data resource for application to veterinary and One Health clinical research efforts. Use of the data is limited by interoperability challenges including inconsistent data formats and data siloing. Clinical coding using standardized medical terminologies enhances the quality of medical records and facilitates their interoperability with veterinary and human health records from other sites. Previous studies, such as DeepTag and VetTag, evaluated the application of Natural Language Processing (NLP) to automate veterinary diagnosis coding, employing long short-term memory (LSTM) and transformer models to infer a subset of Systemized Nomenclature of Medicine - Clinical Terms (SNOMED-CT) diagnosis codes from free-text clinical notes. This study expands on these efforts by incorporating all 7,739 distinct SNOMED-CT diagnosis codes recognized by the Colorado State University (CSU) Veterinary Teaching Hospital (VTH) and by leveraging the increasing availability of pre-trained large language models (LLMs). Ten freely-available pre-trained LLMs were fine-tuned on the free-text notes from 246,473 manually-coded veterinary patient visits included in the CSU VTH's electronic health records (EHRs), which resulted in superior performance relative to previous efforts. The most accurate results were obtained when expansive labeled data were used to fine-tune relatively large clinical LLMs, but the study also showed that comparable results can be obtained using more limited resources and non-clinical LLMs. The results of this study contribute to the improvement of the quality of veterinary EHRs by investigating accessible methods for automated coding and support both animal and human health research by paving the way for more integrated and comprehensive health databases that span species and institutions.

Problem

Research questions and friction points this paper is trying to address.

Automating veterinary diagnosis coding from free-text clinical notes

Overcoming interoperability challenges in veterinary health records

Leveraging pre-trained language models for SNOMED-CT code assignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuning pre-trained language models for coding

Leveraging 246,473 veterinary EHRs for model training

Automating SNOMED-CT diagnosis code extraction from text

🔎 Similar Papers

Large Language Models for Disease Diagnosis: A Scoping Review