ECG-LLM-- training and evaluation of domain-specific large language models for electrocardiography

📅 2025-10-21

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This study addresses the challenge of adapting large language models (LLMs) to electrocardiogram (ECG) interpretation. We propose a cardiovascular medicine–specific domain adaptation framework built upon open-source models such as Llama 3.1 70B, integrating supervised fine-tuning (SFT) with retrieval-augmented generation (RAG). A multi-tiered evaluation system is established, encompassing multiple-choice accuracy, automated text metrics (BLEU/ROUGE), and dual-track assessment via both LLM-as-a-judge and clinical expert annotation. Our key contribution is the empirical validation that lightweight domain fine-tuning—particularly suitable for privacy-sensitive, on-premise deployment—significantly outperforms base models in ECG diagnostic reasoning, achieving performance comparable to Claude 3.7 and surpassing most general-purpose LLMs. The work delivers a reproducible technical pathway and standardized evaluation paradigm for clinical-grade, regulatory-compliant, and edge-deployable AI-assisted ECG diagnosis.

Technology Category

Application Category

📝 Abstract

Domain-adapted open-weight large language models (LLMs) offer promising healthcare applications, from queryable knowledge bases to multimodal assistants, with the crucial advantage of local deployment for privacy preservation. However, optimal adaptation strategies, evaluation methodologies, and performance relative to general-purpose LLMs remain poorly characterized. We investigated these questions in electrocardiography, an important area of cardiovascular medicine, by finetuning open-weight models on domain-specific literature and implementing a multi-layered evaluation framework comparing finetuned models, retrieval-augmented generation (RAG), and Claude Sonnet 3.7 as a representative general-purpose model. Finetuned Llama 3.1 70B achieved superior performance on multiple-choice evaluations and automatic text metrics, ranking second to Claude 3.7 in LLM-as-a-judge assessments. Human expert evaluation favored Claude 3.7 and RAG approaches for complex queries. Finetuned models significantly outperformed their base counterparts across nearly all evaluation modes. Our findings reveal substantial performance heterogeneity across evaluation methodologies, underscoring assessment complexity. Nevertheless, domain-specific adaptation through finetuning and RAG achieves competitive performance with proprietary models, supporting the viability of privacy-preserving, locally deployable clinical solutions.

Problem

Research questions and friction points this paper is trying to address.

Developing domain-specific LLMs for electrocardiography applications in healthcare

Evaluating optimal adaptation strategies for medical domain LLM performance

Comparing privacy-preserving local models against general-purpose proprietary LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuned open-weight LLMs on ECG literature

Implemented multi-layered evaluation framework comparing methods

Used retrieval-augmented generation for enhanced clinical queries

🔎 Similar Papers

MEIT: Multi-Modal Electrocardiogram Instruction Tuning on Large Language Models for Report Generation