🤖 AI Summary
Commercial large language models (LLMs) exhibit uncharacterized distributional distortion and feature correlation collapse when generating synthetic electronic health records (EHRs), severely limiting cross-hospital generalization in high-dimensional clinical data.
Method: We systematically evaluate GPT-4, Claude, and Gemini using a novel framework integrating structured prompt engineering with multi-center EHR pattern analysis, and quantify generation fidelity via statistical tests—including the Kolmogorov–Smirnov (KS) test.
Contribution/Results: We find that while LLMs preserve distributional fidelity on low-dimensional EHR subsets (KS *p* > 0.05), they significantly deviate from real-data distributions in full-dimensional EHRs (KS *p* < 0.01). Crucially, cross-institutional correlation modeling degrades sharply with increasing dimensionality. Our study identifies feature dimensionality expansion as a critical bottleneck for cross-hospital generalization of synthetic EHRs—providing empirical evidence and actionable insights for developing trustworthy generative AI in healthcare.
📝 Abstract
Synthetic Electronic Health Records (EHRs) offer a valuable opportunity to create privacy preserving and harmonized structured data, supporting numerous applications in healthcare. Key benefits of synthetic data include precise control over the data schema, improved fairness and representation of patient populations, and the ability to share datasets without concerns about compromising real individuals privacy. Consequently, the AI community has increasingly turned to Large Language Models (LLMs) to generate synthetic data across various domains. However, a significant challenge in healthcare is ensuring that synthetic health records reliably generalize across different hospitals, a long standing issue in the field. In this work, we evaluate the current state of commercial LLMs for generating synthetic data and investigate multiple aspects of the generation process to identify areas where these models excel and where they fall short. Our main finding from this work is that while LLMs can reliably generate synthetic health records for smaller subsets of features, they struggle to preserve realistic distributions and correlations as the dimensionality of the data increases, ultimately limiting their ability to generalize across diverse hospital settings.