A text-to-tabular approach to generate synthetic patient data using LLMs

📅 2024-12-06
🏛️ arXiv.org
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
To address the scarcity of medical data due to privacy constraints, data-sharing barriers, and high acquisition costs, this paper proposes a novel paradigm for synthesizing patient tabular data—requiring neither raw data nor domain-specific training, but only natural-language database descriptions. Methodologically, it leverages large language models (LLMs) integrated with medical knowledge injection, structured prompt engineering, and in-context learning to enable end-to-end, text-driven generation. Its key contributions are: (i) the first systematic validation that LLMs can generate tabular clinical data achieving high fidelity, privacy preservation, and utility while maintaining clinical relevance; (ii) a three-dimensional quantitative evaluation framework and ablation studies identifying critical prompt components. Experiments demonstrate that our approach rivals state-of-the-art models trained on real data across multiple metrics, faithfully preserves disease–symptom–test associations, and supports rapid customization and educational use.

Technology Category

Application Category

📝 Abstract
Access to large-scale high-quality healthcare databases is key to accelerate medical research and make insightful discoveries about diseases. However, access to such data is often limited by patient privacy concerns, data sharing restrictions and high costs. To overcome these limitations, synthetic patient data has emerged as an alternative. However, synthetic data generation (SDG) methods typically rely on machine learning (ML) models trained on original data, leading back to the data scarcity problem. We propose an approach to generate synthetic tabular patient data that does not require access to the original data, but only a description of the desired database. We leverage prior medical knowledge and in-context learning capabilities of large language models (LLMs) to generate realistic patient data, even in a low-resource setting. We quantitatively evaluate our approach against state-of-the-art SDG models, using fidelity, privacy, and utility metrics. Our results show that while LLMs may not match the performance of state-of-the-art models trained on the original data, they effectively generate realistic patient data with well-preserved clinical correlations. An ablation study highlights key elements of our prompt contributing to high-quality synthetic patient data generation. This approach, which is easy to use and does not require original data or advanced ML skills, is particularly valuable for quickly generating custom-designed patient data, supporting project implementation and providing educational resources.
Problem

Research questions and friction points this paper is trying to address.

Generates synthetic patient data without original data access
Uses LLMs to create realistic data with clinical correlations
Addresses privacy and scarcity issues in healthcare databases
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based text-to-tabular synthetic data generation
No original data needed, only database description
Leverages medical knowledge and in-context learning
🔎 Similar Papers
No similar papers found.
M
Margaux Tornqvist
Quinten Health, 8 rue Vernier, Paris, France
Jean-Daniel Zucker
Jean-Daniel Zucker
Senior Researcher, UMMISCO, IRD/Sorbonne University, France
Machine LearningData ScienceAbstractionMetagenomicsNLP
T
Tristan Fauvel
Quinten Health, 8 rue Vernier, Paris, France
Nicolas Lambert
Nicolas Lambert
Massachusetts Institute of Technology
EconomicsAIComputer Science
M
Mathilde Berthelot
Quinten Health, 8 rue Vernier, Paris, France
A
Antoine Movschin
Quinten Health, 8 rue Vernier, Paris, France