🤖 AI Summary
This study addresses the challenge of unmeasured confounding in causal inference from electronic health records, where critical confounders—such as frailty and goals of care—are often embedded in clinical free text rather than structured data. It is the first to systematically leverage large language models (LLMs) to extract latent confounders from MIMIC-IV clinical notes and evaluate seven strategies for integrating these covariates into causal effect estimation of early vasopressor use on 28-day mortality among sepsis patients. Results demonstrate that directly incorporating LLM-derived covariates into propensity score models yields the most robust estimates, substantially reducing bias from 0.0143 to 0.0003 and adjusting the treatment effect estimate from 0.055 to 0.027 in both real and semi-synthetic datasets, aligning directionally with findings from the CLOVERS randomized trial.
📝 Abstract
Causal inference from electronic health records (EHR) is fundamentally limited by unmeasured confounding: critical clinical states such as frailty, goals of care, and mental status are documented in free-text notes but absent from structured data. Large language models can extract these latent confounders as interpretable, structured covariates, yet how to effectively integrate them into causal estimation pipelines has not been systematically studied. Using the MIMIC-IV database with 21,859 sepsis patients, we compare seven covariate-integration strategies for estimating the effect of early vasopressor initiation on 28-day mortality, spanning tabular-only baselines, traditional NLP representations, and three LLM-augmented approaches. A central finding is that not all integration strategies are equally effective: directly augmenting the propensity score model with LLM covariates achieves the best performance, while dual-caliper matching on text-derived categorical distances restricts the donor pool and degrades estimation. In semi-synthetic experiments with known ground-truth effects, LLM-augmented propensity scores reduce estimation bias from 0.0143 to 0.0003 relative to tabular-only methods, and this advantage persists under substantial simulated extraction error. On real data, incorporating LLM-extracted covariates reduces the estimated treatment effect from 0.055 to 0.027, directionally consistent with the CLOVERS randomized trial, and a doubly robust estimator yielding 0.019 confirms the robustness of this finding. Our results offer practical guidance on when and how text-derived covariates improve causal estimation in critical care.