🤖 AI Summary
This work addresses the challenge of natural language to SQL generation in real-world clinical settings using electronic health records (EHRs) by introducing CLINSQL, a benchmark comprising 633 expert-annotated tasks that require models to comprehend complex clinical semantics—including the multi-table schema of MIMIC-IV v3.1, temporal windows, and patient similarity cohorts. For the first time, it incorporates patient similarity reasoning and clinical constraint mechanisms to emphasize both executability and clinical reliability of generated SQL queries. The study systematically evaluates 22 large language models through a pipeline integrating chain-of-thought self-refinement, rule-guided scoring, and execution validation. Results show that GPT-5-mini achieves 74.7% execution accuracy, while DeepSeek-R1 leads open-source models at 69.2%, yet overall performance remains insufficient for direct clinical deployment.
📝 Abstract
Real-world clinical text-to-SQL requires reasoning over heterogeneous EHR tables, temporal windows, and patient-similarity cohorts to produce executable queries. We introduce CLINSQL, a benchmark of 633 expert-annotated tasks on MIMIC-IV v3.1 that demands multi-table joins, clinically meaningful filters, and executable SQL. Solving CLINSQL entails navigating schema metadata and clinical coding systems, handling long contexts, and composing multi-step queries beyond traditional text-to-SQL. We evaluate 22 proprietary and open-source models under Chain-of-Thought self-refinement and use rubric-based SQL analysis with execution checks that prioritize critical clinical requirements. Despite recent advances, performance remains far from clinical reliability: on the test set, GPT-5-mini attains 74.7% execution score, DeepSeek-R1 leads open-source at 69.2% and Gemini-2.5-Pro drops from 85.5% on Easy to 67.2% on Hard. Progress on CLINSQL marks tangible advances toward clinically reliable text-to-SQL for real-world EHR analytics.