BEAVER: An Enterprise Benchmark for Text-to-SQL

📅 2024-09-03

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

155K/year

🤖 AI Summary

Large language models (LLMs) exhibit severe performance degradation—achieving under 30% execution accuracy—in enterprise-level Text-to-SQL tasks, due to fundamental mismatches between public benchmarks and real-world data warehouses. Method: We introduce BEAVER, the first evaluation benchmark grounded in authentic enterprise data warehouse query logs. Through systematic analysis, we identify three key differentiators: high schema complexity, deep business logic, and data invisibility—factors absent in public Web-table benchmarks. Leveraging high-quality natural language–SQL (NL-SQL) pairs, we conduct a rigorous, standardized evaluation of mainstream LLMs using prompt engineering and retrieval-augmented generation (RAG). Contribution/Results: Our study establishes the first enterprise-oriented Text-to-SQL evaluation framework, revealing intrinsic bottlenecks in multi-table joins and nested aggregations. BEAVER provides a foundational benchmark, actionable insights, and methodological guidance for advancing enterprise NL-to-SQL research.

Technology Category

Application Category

📝 Abstract

Existing text-to-SQL benchmarks have largely been constructed from web tables with human-generated question-SQL pairs. LLMs typically show strong results on these benchmarks, leading to a belief that LLMs are effective at text-to-SQL tasks. However, how these results transfer to enterprise settings is unclear because tables in enterprise databases might differ substantially from web tables in structure and content. To contend with this problem, we introduce a new dataset BEAVER, the first enterprise text-to-SQL benchmark sourced from real private enterprise data warehouses. This dataset includes natural language queries and their correct SQL statements, which we collected from actual query logs. We then benchmark off-the-shelf LLMs on this dataset. LLMs perform poorly, even when augmented with standard prompt engineering and RAG techniques. We identify three main reasons for the poor performance: (1) schemas of enterprise tables are more complex than the schemas in public data, resulting in SQL-generation tasks intrinsically harder; (2) business-oriented questions are often more complex, requiring joins over multiple tables, aggregations, and nested queries; (3) public LLMs cannot train on private enterprise data warehouses that are not publicly accessible, and therefore it is difficult for the model to learn to solve (1) and (2). We believe BEAVER will facilitate future research in building text-to-SQL systems that perform better in enterprise settings.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Natural Language to SQL Translation

Enterprise Database Environment

Innovation

Methods, ideas, or system contributions that make the work stand out.

BEAVER Dataset

Text-to-SQL Conversion

Enterprise-Level Complexity

🔎 Similar Papers

A Survey on Employing Large Language Models for Text-to-SQL Tasks