đ€ AI Summary
Structured documentation of Social Determinants of Health (SDoH) in French electronic health records (EHRs) is severely inadequate, with ICD-10 coding capturing only 2.8% of relevant cases.
Method: This study introduces the first large language modelâbased (Flan-T5-Large) multi-class SDoH information extraction system for French clinical text, automatically identifying 13 SDoH categoriesâincluding temporal attributes and quantitative detailsâfrom unstructured clinical notes. It integrates manually annotated social history texts with fine-grained prompt engineering to jointly perform named entity recognition and relation extraction.
Contribution/Results: The model achieves a mean F1-score above 0.80 across all SDoH categories and identifies at least one SDoH for 95.8% of patients. We publicly release two high-quality, expert-annotated French SDoH datasetsâthe first of their kindâaddressing the critical gap in non-English clinical NLP resources and enabling scalable public health research and health equity analysis.
đ Abstract
Social determinants of health (SDoH) significantly influence health outcomes, shaping disease progression, treatment adherence, and health disparities. However, their documentation in structured electronic health records (EHRs) is often incomplete or missing. This study presents an approach based on large language models (LLMs) for extracting 13 SDoH categories from French clinical notes. We trained Flan-T5-Large on annotated social history sections from clinical notes at Nantes University Hospital, France. We evaluated the model at two levels: (i) identification of SDoH categories and associated values, and (ii) extraction of detailed SDoH with associated temporal and quantitative information. The model performance was assessed across four datasets, including two that we publicly release as open resources. The model achieved strong performance for identifying well-documented categories such as living condition, marital status, descendants, job, tobacco, and alcohol use (F1 score > 0.80). Performance was lower for categories with limited training data or highly variable expressions, such as employment status, housing, physical activity, income, and education. Our model identified 95.8% of patients with at least one SDoH, compared to 2.8% for ICD-10 codes from structured EHR data. Our error analysis showed that performance limitations were linked to annotation inconsistencies, reliance on English-centric tokenizer, and reduced generalizability due to the model being trained on social history sections only. These results demonstrate the effectiveness of NLP in improving the completeness of real-world SDoH data in a non-English EHR system.