🤖 AI Summary
This work addresses the challenge clinicians face in efficiently querying complex oncology trial databases due to limited SQL proficiency. To bridge this gap, the authors propose a feedback-driven clinical natural language to SQL (NL2SQL) system that generates accurate queries by integrating predicate-level question decomposition, a schema-aware large language model, and sentence-embedding-based retrieval grounded in the database schema. The system further incorporates user editing feedback and a logic-based mechanism for automatically generating SQL variants, enabling dynamic expansion of a high-quality example repository without manual annotation. Supporting interactive querying and continuous refinement, the approach significantly enhances the accuracy and practical utility of natural language interfaces for oncology database interrogation.
📝 Abstract
Clinicians exploring oncology trial repositories often need ad-hoc, multi-constraint queries over biomarkers, endpoints, interventions, and time, yet writing SQL requires schema expertise. We demo FD-NL2SQL, a feedback-driven clinical NL2SQL assistant for SQLite-based oncology databases. Given a natural-language question, a schema-aware LLM decomposes it into predicate-level sub-questions, retrieves semantically similar expert-verified NL2SQL exemplars via sentence embeddings, and synthesizes executable SQL conditioned on the decomposition, retrieved exemplars, and schema, with post-processing validity checks. To improve with use, FD-NL2SQL incorporates two update signals: (i) clinician edits of generated SQL are approved and added to the exemplar bank; and (ii) lightweight logic-based SQL augmentation applies a single atomic mutation (e.g., operator or column change), retaining variants only if they return non-empty results. A second LLM generates the corresponding natural-language question and predicate decomposition for accepted variants, automatically expanding the exemplar bank without additional annotation. The demo interface exposes decomposition, retrieval, synthesis, and execution results to support interactive refinement and continuous improvement.