π€ AI Summary
This work addresses the challenges of terminology variability and noise in medical-domain Text-to-SQL tasks, where conventional retrieval-augmented approaches struggle to balance coverage and accuracy using static example pools. The authors propose CBR-to-SQL, a novel framework that introduces case-based reasoning (CBR) to this task by abstracting questionβSQL pairs into reusable, structured case templates. A two-stage retrieval mechanism is designed: first matching logical structures and then resolving specific entities. This approach significantly enhances sample efficiency and robustness, achieving state-of-the-art logical form accuracy and competitive execution accuracy on the MIMICSQL dataset. Notably, it demonstrates superior performance under data-scarce conditions and in the presence of retrieval perturbations.
π Abstract
Extracting insights from Electronic Health Record (EHR) databases often requires SQL expertise, creating a barrier for healthcare decision-making and research. While a promising approach is to use Large Language Models (LLMs) to translate natural language questions to SQL via Retrieval-Augmented Generation (RAG), adapting this approach to the medical domain is non-trivial. Standard RAG relies on single-step retrieval from a static pool of examples, which struggles with the variability and noise of medical terminology and jargon. This often leads to anti-patterns such as expanding the task demonstration pool to improve coverage, which in turn introduces noise and scalability problems. To address this, we introduce CBR-to-SQL, a framework inspired by Case-Based Reasoning (CBR). It represents question-SQL pairs as reusable, abstract case templates and utilizes a two-stage retrieval process that first captures logical structure and then resolves relevant entities. Evaluated on MIMICSQL, CBR-to-SQL achieves state-of-the-art logical form accuracy and competitive execution accuracy. More importantly, it demonstrates higher sample efficiency and robustness than standard RAG approaches, particularly under data scarcity and retrieval perturbations.