Automating Clinical Information Retrieval from Finnish Electronic Health Records Using Large Language Models

๐Ÿ“… 2026-03-27
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work proposes the first fully on-premises clinical contextual question answering (CCQA) framework that leverages open-source large language modelsโ€”ranging from 4B to 70B parameters, including Llama-3.1-70B and Qwen3-30B-A3B-2507โ€”to directly answer clinical queries from Finnish electronic health records (EHRs) without transmitting data externally. The system employs 4/8-bit quantization for efficient deployment and presents the first systematic evaluation of model accuracy, consistency, and calibration in a real-world offline clinical setting. Experimental results demonstrate that Llama-3.1-70B achieves 95.3% accuracy and 97.3% consistency in free-text generation, with quantized variants preserving performance while substantially reducing GPU memory usage. Clinical review revealed that only 2.9% of outputs contained clinically significant errors, underscoring the critical role of semantic equivalence in ensuring safety.
๐Ÿ“ Abstract
Clinicians often need to retrieve patient-specific information from electronic health records (EHRs), a task that is time-consuming and error-prone. We present a locally deployable Clinical Contextual Question Answering (CCQA) framework that answers clinical questions directly from EHRs without external data transfer. Open-source large language models (LLMs) ranging from 4B to 70B parameters were benchmarked under fully offline conditions using 1,664 expert-annotated question-answer pairs derived from records of 183 patients. The dataset consisted predominantly of Finnish clinical text. In free-text generation, Llama-3.1-70B achieved 95.3% accuracy and 97.3% consistency across semantically equivalent question variants, while the smaller Qwen3-30B-A3B-2507 model achieved comparable performance. In a multiple-choice setting, models showed similar accuracy but variable calibration. Low-precision quantization (4-bit and 8-bit) preserved predictive performance while reducing GPU memory requirements and improving deployment feasibility. Clinical evaluation identified clinically significant errors in 2.9% of outputs, and semantically equivalent questions occasionally yielded discordant responses, including instances where one formulation was correct and the other contained a clinically significant error (0.96% of cases). These findings demonstrate that locally hosted open-source LLMs can accurately retrieve patient-specific information from EHRs using natural-language queries, while highlighting the need for validation and human oversight in clinical deployment.
Problem

Research questions and friction points this paper is trying to address.

Clinical Information Retrieval
Electronic Health Records
Large Language Models
Natural Language Processing
Finnish Clinical Text
Innovation

Methods, ideas, or system contributions that make the work stand out.

Clinical Question Answering
Large Language Models
Electronic Health Records
Local Deployment
Model Quantization
๐Ÿ”Ž Similar Papers
No similar papers found.
M
Mikko Saukkoriipi
Department of Computer Science, Aalto University School of Science, Espoo, 02150, Finland.
N
Nicole Hernandez
Faculty of Medicine and Health Technology, Tampere University, Tampere, 33520, Finland.
J
Jaakko Sahlsten
Department of Computer Science, Aalto University School of Science, Espoo, 02150, Finland.
Kimmo Kaski
Kimmo Kaski
Professor of Computational Science, Aalto University School of Science
Computational ScienceStatistical PhysicsComplex Systems & NetworksComputational Social ScienceData Science & AI
O
Otso Arponen
Faculty of Medicine and Health Technology, Tampere University, Tampere, 33520, Finland.; Department of Oncology, TAYS Cancer Centrer, Tampere University Hospital, Tampere, 33520, Finland.