ScenarioBench: Trace-Grounded Compliance Evaluation for Text-to-SQL and RAG

📅 2025-09-28

📈 Citations: 0

✨ Influential: 0

career value

136K/year

🤖 AI Summary

This work addresses the insufficient evaluation of decision correctness and explanation auditability for Text-to-SQL and RAG systems in regulatory compliance scenarios. We introduce the first benchmark explicitly designed for policy clauses, emphasizing evidence traceability. Our method features: (1) a fine-grained clause-binding mechanism that anchors each decision to specific statutory provisions; (2) a minimal-witness trace framework ensuring verifiable and auditable explanations; and (3) a Scenario Difficulty Index quantifying the trade-off between explanation quality and retrieval difficulty. Test scenarios are defined in YAML; SQL correctness is validated via result-set equivalence, explanation fidelity via clause-ID matching, and system behavior via multi-dimensional metrics—including retrieval effectiveness, latency, and hallucination rate. Experiments demonstrate that our benchmark significantly enhances explanation credibility and evaluation granularity under strict grounding and zero-peeking constraints.

Technology Category

Application Category

📝 Abstract

ScenarioBench is a policy-grounded, trace-aware benchmark for evaluating Text-to-SQL and retrieval-augmented generation in compliance contexts. Each YAML scenario includes a no-peek gold-standard package with the expected decision, a minimal witness trace, the governing clause set, and the canonical SQL, enabling end-to-end scoring of both what a system decides and why. Systems must justify outputs using clause IDs from the same policy canon, making explanations falsifiable and audit-ready. The evaluator reports decision accuracy, trace quality (completeness, correctness, order), retrieval effectiveness, SQL correctness via result-set equivalence, policy coverage, latency, and an explanation-hallucination rate. A normalized Scenario Difficulty Index (SDI) and a budgeted variant (SDI-R) aggregate results while accounting for retrieval difficulty and time. Compared with prior Text-to-SQL or KILT/RAG benchmarks, ScenarioBench ties each decision to clause-level evidence under strict grounding and no-peek rules, shifting gains toward justification quality under explicit time budgets.

Problem

Research questions and friction points this paper is trying to address.

Evaluating Text-to-SQL and RAG compliance using policy-grounded scenarios

Measuring system decisions and trace quality with falsifiable explanations

Assessing retrieval effectiveness and SQL correctness under time constraints

Innovation

Methods, ideas, or system contributions that make the work stand out.

Policy-grounded benchmark for compliance evaluation

Trace-aware scoring with falsifiable clause-based explanations

Normalized difficulty index accounting retrieval and time

🔎 Similar Papers

No similar papers found.