🤖 AI Summary
Biomedical Text-to-SQL systems often generate erroneous SQL for ambiguous, out-of-scope, or unanswerable queries, undermining reliability. Method: We propose an explicit refusal mechanism to enhance trustworthiness, introducing (i) novel No-Answer Rules (NAR) and a balanced few-shot prompting strategy; (ii) OncoMX-NAQ—the first biomedical unanswerable question benchmark (80 instances across 8 categories); and (iii) a unified framework integrating schema-aware prompting, rule-guided learning, and structured refusal classification. Contributions/Results: Our approach enables interpretable refusal decisions and features a lightweight, interactive visualization interface. On OncoMX-NAQ, it achieves 0.80 overall refusal accuracy, with near-perfect (≈100%) accuracy on critical error classes—including non-SQL queries, missing-column cases, and out-of-domain questions. Crucially, it supports synchronous presentation of generated SQL, execution results, and human-readable refusal rationales.
📝 Abstract
Text-to-SQL systems allow non-SQL experts to interact with relational databases using natural language. However, their tendency to generate executable SQL for ambiguous, out-of-scope, or unanswerable queries introduces a hidden risk, as outputs may be misinterpreted as correct. This risk is especially serious in biomedical contexts, where precision is critical. We therefore present Query Carefully, a pipeline that integrates LLM-based SQL generation with explicit detection and handling of unanswerable inputs. Building on the OncoMX component of ScienceBenchmark, we construct OncoMX-NAQ (No-Answer Questions), a set of 80 no-answer questions spanning 8 categories (non-SQL, out-of-schema/domain, and multiple ambiguity types). Our approach employs llama3.3:70b with schema-aware prompts, explicit No-Answer Rules (NAR), and few-shot examples drawn from both answerable and unanswerable questions. We evaluate SQL exact match, result accuracy, and unanswerable-detection accuracy. On the OncoMX dev split, few-shot prompting with answerable examples increases result accuracy, and adding unanswerable examples does not degrade performance. On OncoMX-NAQ, balanced prompting achieves the highest unanswerable-detection accuracy (0.8), with near-perfect results for structurally defined categories (non-SQL, missing columns, out-of-domain) but persistent challenges for missing-value queries (0.5) and column ambiguity (0.3). A lightweight user interface surfaces interim SQL, execution results, and abstentions, supporting transparent and reliable text-to-SQL in biomedical applications.