🤖 AI Summary
Existing text-to-SQL benchmarks emphasize factual retrieval and fail to support higher-order analytical queries—such as descriptive analysis, causal explanation, temporal forecasting, and strategic recommendation—essential for real-world business decision-making; moreover, state-of-the-art large language models exhibit notable weaknesses in causal reasoning, time-series prediction, and strategic suggestion. Method: We introduce CORGI, the first text-to-SQL benchmark explicitly designed for authentic commercial scenarios, built upon synthetically generated enterprise-scale databases (e.g., modeled after DoorDash and Airbnb architectures) and featuring a curated evaluation suite covering four categories of complex business queries. CORGI pioneers a multi-level assessment framework integrating strategic recommendation and causal inference. Contribution/Results: Experiments show CORGI is 21% more challenging than BIRD, effectively exposing critical capability gaps of current LLMs in business intelligence tasks and establishing a novel benchmark and evaluation paradigm for multi-step decision-making agents.
📝 Abstract
In the business domain, where data-driven decision making is crucial, text-to-SQL is fundamental for easy natural language access to structured data. While recent LLMs have achieved strong performance in code generation, existing text-to-SQL benchmarks remain focused on factual retrieval of past records. We introduce CORGI, a new benchmark specifically designed for real-world business contexts. CORGI is composed of synthetic databases inspired by enterprises such as Doordash, Airbnb, and Lululemon. It provides questions across four increasingly complex categories of business queries: descriptive, explanatory, predictive, and recommendational. This challenge calls for causal reasoning, temporal forecasting, and strategic recommendation, reflecting multi-level and multi-step agentic intelligence. We find that LLM performance drops on high-level questions, struggling to make accurate predictions and offer actionable plans. Based on execution success rate, the CORGI benchmark is about 21% more difficult than the BIRD benchmark. This highlights the gap between popular LLMs and the need for real-world business intelligence. We release a public dataset and evaluation framework, and a website for public submissions.