IndicDB -- Benchmarking Multilingual Text-to-SQL Capabilities in Indian Languages

📅 2026-04-15
📈 Citations: 0
Influential: 0
📄 PDF

career value

181K/year
🤖 AI Summary
This work addresses the limitations of existing Text-to-SQL benchmarks, which are predominantly confined to Western contexts and simplified schemas, rendering them inadequate for evaluating multilingual settings beyond English. To bridge this gap, the authors introduce IndicDB—the first high-complexity, multilingual Text-to-SQL benchmark focused on Indian languages. Constructed from Indian government open data, IndicDB comprises 20 deeply nested relational databases (up to six join levels) and covers English, Hindi, and five other Indian languages. Using a three-agent iterative framework—comprising an architect, auditor, and optimizer—the authors generate 15,617 rigorously structured, value-aware, and difficulty-calibrated queries. Evaluation on state-of-the-art large language models reveals a significant “Indic Gap,” with performance on Indian languages lagging behind English by an average of 9.00%, primarily due to more challenging schema linking, structural ambiguity, and insufficient external knowledge.

Technology Category

Application Category

📝 Abstract
While Large Language Models (LLMs) have significantly advanced Text-to-SQL performance, existing benchmarks predominantly focus on Western contexts and simplified schemas, leaving a gap in real-world, non-Western applications. We present IndicDB, a multilingual Text-to-SQL benchmark for evaluating cross-lingual semantic parsing across diverse Indic languages. The relational schemas are sourced from open-data platforms, including the National Data and Analytics Platform (NDAP) and the India Data Portal (IDP), ensuring realistic administrative data complexity. IndicDB comprises 20 databases across 237 tables. To convert denormalized government data into rich relational structures, we employ an iterative three-agent framework (Architect, Auditor, Refiner) to ensure structural rigor and high relational density (11.85 tables per database; join depths up to six). Our pipeline is value-aware, difficulty-calibrated, and join-enforced, generating 15,617 tasks across English, Hindi, and five Indic languages. We evaluate cross-lingual semantic parsing performance of state-of-the-art models (DeepSeek v3.2, MiniMax 2.7, LLaMA 3.3, Qwen3) across seven linguistic variants. Results show a 9.00% performance drop from English to Indic languages, revealing an "Indic Gap" driven by harder schema linking, increased structural ambiguity, and limited external knowledge. IndicDB serves as a rigorous benchmark for multilingual Text-to-SQL. Code and data: https://anonymous.4open.science/r/multilingualText2Sql-Indic--DDCC/
Problem

Research questions and friction points this paper is trying to address.

Text-to-SQL
multilingual
Indic languages
semantic parsing
benchmark
Innovation

Methods, ideas, or system contributions that make the work stand out.

multilingual Text-to-SQL
Indic languages
relational schema construction
three-agent framework
cross-lingual semantic parsing
🔎 Similar Papers
2024-09-09North American Chapter of the Association for Computational LinguisticsCitations: 0