Foundation Models to Unlock Real-World Evidence from Nationwide Medical Claims

📅 2026-05-04
📈 Citations: 0
Influential: 0
📄 PDF

career value

180K/year
🤖 AI Summary
This work proposes ReClaim—the first generative foundation model for healthcare trained on nationwide-scale insurance claims data—designed to extract generalizable clinical and economic insights from over 200 million beneficiaries and 43.8 billion medical events spanning 2008–2022. Built on a from-scratch Transformer architecture (140M–1.7B parameters), ReClaim models longitudinal trajectories of diagnoses, procedures, medications, and expenditures, enabling multitask disease prediction and real-world evidence generation. Across more than 1,000 disease prediction tasks, it achieves an average AUC of 75.6%, outperforming LightGBM and Delphi by 9.3 and 6.2 percentage points, respectively. It also improves expenditure prediction, raising explained variance from 0.28 to 0.37, and reduces average systematic bias in target trial emulation by 72%, substantially surpassing existing approaches.
📝 Abstract
Evidence derived from large-scale real-world data (RWD) is increasingly informing regulatory evaluation and healthcare decision-making. Administrative claims provide population-scale, longitudinal records of healthcare utilization, expenditure, and detailed coding of diagnoses, procedures, and medications, yet their potential as a substrate for healthcare foundation models remains largely unexplored. Here we present ReClaim, a generative transformer trained from scratch on 43.8 billion medical events from more than 200 million enrollees in the MarketScan claims data spanning 2008-2022. ReClaim models longitudinal trajectories across diagnoses, procedures, medications, and expenditure, and was scaled to 140 million, 700 million, and 1.7 billion parameters. Across over 1,000 disease-onset prediction tasks, ReClaim achieved a mean AUC of 75.6%, substantially outperforming disease-specific LightGBM (66.3%) and the transformer-based Delphi model (69.4%), with the largest gains for rare diseases. These advantages held across retrospective and prospective evaluations and in external validation on two independent datasets. Performance improved monotonically with scale, and post-training added 13.8 percentage points over pre-training alone. Beyond disease prediction, ReClaim captured financial outcomes and improved real-world evidence (RWE) analyses: for healthcare expenditure forecasting it increased explained variance from 0.28 to 0.37 relative to LightGBM, and in a target trial emulation it reduced systematic bias by 72% on average relative to Delphi. Together, these results establish administrative claims as a scalable substrate for healthcare foundation models and show that learned representations generalize across time periods and data sources, supporting disease surveillance, expenditure forecasting, and RWE generation.
Problem

Research questions and friction points this paper is trying to address.

real-world evidence
healthcare foundation models
administrative claims
disease prediction
medical expenditure forecasting
Innovation

Methods, ideas, or system contributions that make the work stand out.

foundation model
real-world evidence
medical claims
transformer
disease prediction
🔎 Similar Papers
No similar papers found.
F
Fan Ma
Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA
Y
Yuntian Liu
Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA
Xiang Lan
Xiang Lan
NC state University
AI4SE
W
Weipeng Zhou
Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA
J
Jun Ni
Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA
M
Mauro Giuffrè
Department of Medical, Surgical and Health Sciences, Università degli Studi di Trieste, Trieste, Italy
Lingfei Qian
Lingfei Qian
Yale University
Xueqing Peng
Xueqing Peng
Yale University
Yujia Zhou
Yujia Zhou
Yale University
Ruey-Ling Weng
Ruey-Ling Weng
Yale University
BioinformaticsUser-Centered DesignHuman-Computer Interaction (HCI)
Huan He
Huan He
Yale University School of Medicine, Department of Biomedical Informatics & Data Science
data visualizationvisual analytics
Lu Li
Lu Li
Ph.D. student, University of Pennsylvania
Qingyu Chen
Qingyu Chen
Biomedical Informatics & Data Science, Yale University; NCBI-NLM, National Institutes of Health
Text miningMachine learningData curationBioNLPMedical Imaging Analysis
A
Andrew Loza
Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA
Laila Rasmy
Laila Rasmy
UTHealth McWilliams School of Biomedical Informatics
Degui Zhi
Degui Zhi
Department Chair, Professor, University of Texas Health Science Center at Houston
EHRImaging geneticsPopulation Genetics Informatics
Yuan Lu
Yuan Lu
I-squared-R
BlockchainsDistributed ComputingDecentralization
C
Chenjie Zeng
Precision Health Informatics Section, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
J
Joshua C Denny
Precision Health Informatics Section, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA; All of Us Research Program, National Institutes of Health, Bethesda, MD, USA
L
Lee Schwamm
Department of Neurology, Yale School of Medicine, Yale University, New Haven, CT, USA
Daniella Meeker
Daniella Meeker
Associate Professor, Biomedical Informatics and Data Science, Yale School of Medicine
Lucila Ohno-Machado
Lucila Ohno-Machado
University of California San Diego
Biomedical InformaticsPredictive Modeling
Yong Chen
Yong Chen
Professor of Biostatistics, The University of Pennsylvania
real-world dataclinical evidence generationlearning health system
Hua Xu
Hua Xu
Robert T. McCluskey Professor, Section of Biomedical Informatics and Data Science, Yale University
natural language processingtext mining