Falcon: A Comprehensive Chinese Text-to-SQL Benchmark for Enterprise-Grade Evaluation

📅 2025-10-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the accuracy bottleneck of Chinese Text-to-SQL in enterprise applications, focusing on challenges in Chinese semantic understanding and compatibility with proprietary SQL dialects (e.g., MaxCompute/Hive), especially under large-scale, non-canonical schemas—characterized by hundreds of tables, ambiguous column names, implicit foreign keys, and domain-specific synonyms—and the mapping of colloquial queries to precise SQL (requiring aggregation/grouping inference, time-window handling, NULL semantics, and nested/window subqueries). To this end, we introduce the first cross-domain benchmark tailored to Chinese semantics and enterprise SQL dialects: it covers 28 real-world databases and 600 questions (77% requiring multi-table reasoning), and is the first to jointly annotate SQL computational features and Chinese semantic intent. It provides standardized schemas, reproducible templates, and an end-to-end automated evaluation pipeline. State-of-the-art LLMs achieve <50% exact match accuracy on this benchmark, exposing critical deficiencies in multi-table join reasoning, aggregation inference, and temporal expression handling.

Technology Category

Application Category

📝 Abstract
We introduce Falcon, a cross-domain Chinese text-to-SQL benchmark grounded in an enterprise-compatible dialect (MaxCompute/Hive). It contains 600 Chinese questions over 28 databases; 77% require multi-table reasoning and over half touch more than four tables. Each example is annotated along SQL-computation features and Chinese semantics. For evaluation, we release a robust execution comparator and an automated evaluation pipeline, under which all current state-of-the-art large-scale models (including Deepseek) achieve accuracies of at most 50%. Major errors originate from two sources: (1) schema linking in large enterprise landscapes - hundreds of tables, denormalized fields, ambiguous column names, implicit foreign-key relations and domain-specific synonyms that make correct join/column selection difficult; and (2) mapping concise, colloquial Chinese into the exact operators and predicates required for analytics - e.g., choosing the correct aggregation and group-by keys, expressing time windows and granularities, applying unit conversions, handling NULLs and data-quality rules, and formulating nested or windowed subqueries. Falcon therefore targets Chinese-specific semantics and enterprise dialects (abbreviations, business jargon, fuzzy entity references) and provides a reproducible middle ground before full production deployment by using realistic enterprise schemas, query templates, an execution comparator, and an automated evaluation pipeline for end-to-end validation.
Problem

Research questions and friction points this paper is trying to address.

Evaluating Chinese text-to-SQL models on enterprise schemas
Addressing schema linking challenges in large-scale databases
Mapping colloquial Chinese queries to precise SQL operations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Enterprise-compatible dialect benchmark for Chinese text-to-SQL
Automated evaluation pipeline with execution comparator
Addresses schema linking and colloquial Chinese mapping challenges
W
Wenzhen Luo
Ant Group
W
Wei Guan
Ant Group
Yifan Yao
Yifan Yao
Drexel University
Y
Yimin Pan
Ant Group
F
Feng Wang
Ant Group
Z
Zhipeng Yu
Ant Group
Z
Zhe Wen
Ant Group
L
Liang Chen
Ant Group
Y
Yihong Zhuang
Ant Group