🤖 AI Summary
This work addresses the insufficient evaluation of decision correctness and explanation auditability for Text-to-SQL and RAG systems in regulatory compliance scenarios. We introduce the first benchmark explicitly designed for policy clauses, emphasizing evidence traceability. Our method features: (1) a fine-grained clause-binding mechanism that anchors each decision to specific statutory provisions; (2) a minimal-witness trace framework ensuring verifiable and auditable explanations; and (3) a Scenario Difficulty Index quantifying the trade-off between explanation quality and retrieval difficulty. Test scenarios are defined in YAML; SQL correctness is validated via result-set equivalence, explanation fidelity via clause-ID matching, and system behavior via multi-dimensional metrics—including retrieval effectiveness, latency, and hallucination rate. Experiments demonstrate that our benchmark significantly enhances explanation credibility and evaluation granularity under strict grounding and zero-peeking constraints.
📝 Abstract
ScenarioBench is a policy-grounded, trace-aware benchmark for evaluating Text-to-SQL and retrieval-augmented generation in compliance contexts. Each YAML scenario includes a no-peek gold-standard package with the expected decision, a minimal witness trace, the governing clause set, and the canonical SQL, enabling end-to-end scoring of both what a system decides and why. Systems must justify outputs using clause IDs from the same policy canon, making explanations falsifiable and audit-ready. The evaluator reports decision accuracy, trace quality (completeness, correctness, order), retrieval effectiveness, SQL correctness via result-set equivalence, policy coverage, latency, and an explanation-hallucination rate. A normalized Scenario Difficulty Index (SDI) and a budgeted variant (SDI-R) aggregate results while accounting for retrieval difficulty and time. Compared with prior Text-to-SQL or KILT/RAG benchmarks, ScenarioBench ties each decision to clause-level evidence under strict grounding and no-peek rules, shifting gains toward justification quality under explicit time budgets.