🤖 AI Summary
Financial Text-to-SQL faces significant challenges—including highly complex schemas, domain-specific terminology, and high error costs—yet lacks large-scale, domain-specific benchmarks and tailored evaluation metrics. To address this gap, we introduce FinSQL, the first comprehensive financial-domain benchmark, comprising 292 tables and 75,725 high-quality question-SQL pairs. We propose FINCH Score, a semantics-aware evaluation metric that precisely quantifies correctness across critical financial SQL dimensions: numerical expressions, temporal logic, and conditional reasoning. Leveraging large language models and reasoning-augmented architectures, we integrate in-context learning with systematic evaluation, achieving substantial improvements in SQL generation accuracy over complex financial schemas. Extensive experiments identify persistent performance bottlenecks across diverse model families in financial settings, establishing a reproducible, comparable foundation for future domain-specific Text-to-SQL research.
📝 Abstract
Text-to-SQL, the task of translating natural language questions into SQL queries, has long been a central challenge in NLP. While progress has been significant, applying it to the financial domain remains especially difficult due to complex schema, domain-specific terminology, and high stakes of error. Despite this, there is no dedicated large-scale financial dataset to advance research, creating a critical gap. To address this, we introduce a curated financial dataset (FINCH) comprising 292 tables and 75,725 natural language-SQL pairs, enabling both fine-tuning and rigorous evaluation. Building on this resource, we benchmark reasoning models and language models of varying scales, providing a systematic analysis of their strengths and limitations in financial Text-to-SQL tasks. Finally, we propose a finance-oriented evaluation metric (FINCH Score) that captures nuances overlooked by existing measures, offering a more faithful assessment of model performance.