Automating Clinical Information Retrieval from Finnish Electronic Health Records Using Large Language Models

📅 2026-03-27

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This work proposes the first fully on-premises clinical contextual question answering (CCQA) framework that leverages open-source large language models—ranging from 4B to 70B parameters, including Llama-3.1-70B and Qwen3-30B-A3B-2507—to directly answer clinical queries from Finnish electronic health records (EHRs) without transmitting data externally. The system employs 4/8-bit quantization for efficient deployment and presents the first systematic evaluation of model accuracy, consistency, and calibration in a real-world offline clinical setting. Experimental results demonstrate that Llama-3.1-70B achieves 95.3% accuracy and 97.3% consistency in free-text generation, with quantized variants preserving performance while substantially reducing GPU memory usage. Clinical review revealed that only 2.9% of outputs contained clinically significant errors, underscoring the critical role of semantic equivalence in ensuring safety.

Technology Category

Application Category

📝 Abstract

Clinicians often need to retrieve patient-specific information from electronic health records (EHRs), a task that is time-consuming and error-prone. We present a locally deployable Clinical Contextual Question Answering (CCQA) framework that answers clinical questions directly from EHRs without external data transfer. Open-source large language models (LLMs) ranging from 4B to 70B parameters were benchmarked under fully offline conditions using 1,664 expert-annotated question-answer pairs derived from records of 183 patients. The dataset consisted predominantly of Finnish clinical text. In free-text generation, Llama-3.1-70B achieved 95.3% accuracy and 97.3% consistency across semantically equivalent question variants, while the smaller Qwen3-30B-A3B-2507 model achieved comparable performance. In a multiple-choice setting, models showed similar accuracy but variable calibration. Low-precision quantization (4-bit and 8-bit) preserved predictive performance while reducing GPU memory requirements and improving deployment feasibility. Clinical evaluation identified clinically significant errors in 2.9% of outputs, and semantically equivalent questions occasionally yielded discordant responses, including instances where one formulation was correct and the other contained a clinically significant error (0.96% of cases). These findings demonstrate that locally hosted open-source LLMs can accurately retrieve patient-specific information from EHRs using natural-language queries, while highlighting the need for validation and human oversight in clinical deployment.

Problem

Research questions and friction points this paper is trying to address.

Clinical Information Retrieval

Electronic Health Records

Large Language Models

Natural Language Processing

Finnish Clinical Text

Innovation

Methods, ideas, or system contributions that make the work stand out.

Clinical Question Answering

Large Language Models

Electronic Health Records