Agent Bain vs. Agent McKinsey: A New Text-to-SQL Benchmark for the Business Domain

📅 2025-10-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing text-to-SQL benchmarks emphasize factual retrieval and fail to support higher-order analytical queries—such as descriptive analysis, causal explanation, temporal forecasting, and strategic recommendation—essential for real-world business decision-making; moreover, state-of-the-art large language models exhibit notable weaknesses in causal reasoning, time-series prediction, and strategic suggestion. Method: We introduce CORGI, the first text-to-SQL benchmark explicitly designed for authentic commercial scenarios, built upon synthetically generated enterprise-scale databases (e.g., modeled after DoorDash and Airbnb architectures) and featuring a curated evaluation suite covering four categories of complex business queries. CORGI pioneers a multi-level assessment framework integrating strategic recommendation and causal inference. Contribution/Results: Experiments show CORGI is 21% more challenging than BIRD, effectively exposing critical capability gaps of current LLMs in business intelligence tasks and establishing a novel benchmark and evaluation paradigm for multi-step decision-making agents.

Technology Category

Application Category

📝 Abstract
In the business domain, where data-driven decision making is crucial, text-to-SQL is fundamental for easy natural language access to structured data. While recent LLMs have achieved strong performance in code generation, existing text-to-SQL benchmarks remain focused on factual retrieval of past records. We introduce CORGI, a new benchmark specifically designed for real-world business contexts. CORGI is composed of synthetic databases inspired by enterprises such as Doordash, Airbnb, and Lululemon. It provides questions across four increasingly complex categories of business queries: descriptive, explanatory, predictive, and recommendational. This challenge calls for causal reasoning, temporal forecasting, and strategic recommendation, reflecting multi-level and multi-step agentic intelligence. We find that LLM performance drops on high-level questions, struggling to make accurate predictions and offer actionable plans. Based on execution success rate, the CORGI benchmark is about 21% more difficult than the BIRD benchmark. This highlights the gap between popular LLMs and the need for real-world business intelligence. We release a public dataset and evaluation framework, and a website for public submissions.
Problem

Research questions and friction points this paper is trying to address.

Addresses text-to-SQL performance in business intelligence contexts
Introduces CORGI benchmark for complex business queries
Evaluates LLMs on predictive and strategic reasoning challenges
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces CORGI benchmark for business text-to-SQL
Synthetic databases simulate real enterprise scenarios
Tests predictive and recommendational query capabilities
🔎 Similar Papers
No similar papers found.
Y
Yue Li
Cornell University
R
Ran Tao
Cornell University
D
Derek Hommel
Gena AI
Y
Yusuf Denizay Donder
Gena AI
S
Sungyong Chang
Cornell University
David Mimno
David Mimno
Associate Professor, Cornell University
Machine LearningText MiningTopic ModelingDigital Humanities
U
Unso Eun Seo Jo
Cornell University, Gena AI