Query, Don't Train: Privacy-Preserving Tabular Prediction from EHR Data via SQL Queries

📅 2025-05-27

📈 Citations: 0

✨ Influential: 0

career value

157K/year

🤖 AI Summary

In privacy-sensitive electronic health record (EHR) settings governed by regulations such as HIPAA and GDPR, conventional predictive modeling is hindered by strict restrictions on access to individual-level patient data. Method: This paper proposes a novel LLM-driven structured prediction paradigm that requires neither access to raw patient records nor supervised model training. Leveraging only database schema metadata, the LLM autonomously generates compliant SQL queries to extract aggregated statistics; subsequent chain-of-reasoning enables interpretable predictive modeling directly from these summaries. The approach ensures “zero raw-data access, zero model training, and end-to-end auditability,” while natively supporting high-dimensional numerical features and robust missing-value handling. Contribution/Results: We introduce the first framework for structured inference using LLMs that relies exclusively on schema information and aggregate statistics—bypassing individual-level data entirely. Evaluated on a MIMIC-style dataset for 30-day type-2 diabetes readmission prediction, it achieves an F1 score of 0.70, surpassing TabPFN (0.68), demonstrating strong privacy preservation, interpretability, and practical performance.

Technology Category

Application Category

📝 Abstract

Electronic health records (EHRs) contain richly structured, longitudinal data essential for predictive modeling, yet stringent privacy regulations (e.g., HIPAA, GDPR) often restrict access to individual-level records. We introduce Query, Don't Train (QDT): a structured-data foundation-model interface enabling tabular inference via LLM-generated SQL over EHRs. Instead of training on or accessing individual-level examples, QDT uses a large language model (LLM) as a schema-aware query planner to generate privacy-compliant SQL queries from a natural language task description and a test-time input. The model then extracts summary-level population statistics through these SQL queries and the LLM performs, chain-of-thought reasoning over the results to make predictions. This inference-time-only approach (1) eliminates the need for supervised model training or direct data access, (2) ensures interpretability through symbolic, auditable queries, (3) naturally handles missing features without imputation or preprocessing, and (4) effectively manages high-dimensional numerical data to enhance analytical capabilities. We validate QDT on the task of 30-day hospital readmission prediction for Type 2 diabetes patients using a MIMIC-style EHR cohort, achieving F1 = 0.70, which outperforms TabPFN (F1 = 0.68). To our knowledge, this is the first demonstration of LLM-driven, privacy-preserving structured prediction using only schema metadata and aggregate statistics - offering a scalable, interpretable, and regulation-compliant alternative to conventional foundation-model pipelines.

Problem

Research questions and friction points this paper is trying to address.

Privacy-preserving prediction from EHR data without individual access

Using LLM-generated SQL queries for interpretable tabular inference

Eliminating supervised training via schema-aware aggregate statistics

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM generates SQL queries for EHR data

Privacy-compliant via aggregate statistics only

No training needed, interpretable symbolic queries

🔎 Similar Papers

Learnable Prompt as Pseudo-Imputation: Rethinking the Necessity of Traditional EHR Data Imputation in Downstream Clinical Prediction