Improving Social Determinants of Health Documentation in French EHRs Using Large Language Models

📅 2025-07-04

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Structured documentation of Social Determinants of Health (SDoH) in French electronic health records (EHRs) is severely inadequate, with ICD-10 coding capturing only 2.8% of relevant cases. Method: This study introduces the first large language model–based (Flan-T5-Large) multi-class SDoH information extraction system for French clinical text, automatically identifying 13 SDoH categories—including temporal attributes and quantitative details—from unstructured clinical notes. It integrates manually annotated social history texts with fine-grained prompt engineering to jointly perform named entity recognition and relation extraction. Contribution/Results: The model achieves a mean F1-score above 0.80 across all SDoH categories and identifies at least one SDoH for 95.8% of patients. We publicly release two high-quality, expert-annotated French SDoH datasets—the first of their kind—addressing the critical gap in non-English clinical NLP resources and enabling scalable public health research and health equity analysis.

Technology Category

Application Category

📝 Abstract

Social determinants of health (SDoH) significantly influence health outcomes, shaping disease progression, treatment adherence, and health disparities. However, their documentation in structured electronic health records (EHRs) is often incomplete or missing. This study presents an approach based on large language models (LLMs) for extracting 13 SDoH categories from French clinical notes. We trained Flan-T5-Large on annotated social history sections from clinical notes at Nantes University Hospital, France. We evaluated the model at two levels: (i) identification of SDoH categories and associated values, and (ii) extraction of detailed SDoH with associated temporal and quantitative information. The model performance was assessed across four datasets, including two that we publicly release as open resources. The model achieved strong performance for identifying well-documented categories such as living condition, marital status, descendants, job, tobacco, and alcohol use (F1 score > 0.80). Performance was lower for categories with limited training data or highly variable expressions, such as employment status, housing, physical activity, income, and education. Our model identified 95.8% of patients with at least one SDoH, compared to 2.8% for ICD-10 codes from structured EHR data. Our error analysis showed that performance limitations were linked to annotation inconsistencies, reliance on English-centric tokenizer, and reduced generalizability due to the model being trained on social history sections only. These results demonstrate the effectiveness of NLP in improving the completeness of real-world SDoH data in a non-English EHR system.

Problem

Research questions and friction points this paper is trying to address.

Extracting SDoH from French clinical notes using LLMs

Improving incomplete SDoH documentation in structured EHRs

Evaluating model performance on diverse SDoH categories

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Flan-T5-Large for SDoH extraction

Trains on French clinical notes

Evaluates model on multiple datasets

🔎 Similar Papers

No similar papers found.

Authors to Follow