Benchmarking Text-to-Python against Text-to-SQL: The Impact of Explicit Logic and Ambiguity

📅 2026-01-22

📈 Citations: 0

✨ Influential: 0

career value

138K/year

🤖 AI Summary

The reliability of Text-to-Python for data retrieval tasks remains unclear, particularly due to ambiguities in user intent and the need for explicit logical specifications. This work introduces BIRD-Python, the first unified cross-paradigm benchmark that systematically compares Text-to-Python with Text-to-SQL through rigorous data curation and execution-based semantic alignment. The study reveals that performance gaps stem primarily from missing domain context rather than deficiencies in code generation capability. To address this, the authors propose a Logic Completion Framework (LCF) that translates ambiguous natural language instructions into executable logical forms. Experimental results demonstrate that integrating LCF enables Text-to-Python to achieve accuracy comparable to Text-to-SQL on data retrieval tasks, highlighting its dependence on implicit domain knowledge and offering an effective mitigation strategy.

Technology Category

Application Category

📝 Abstract

While Text-to-SQL remains the dominant approach for database interaction, real-world analytics increasingly require the flexibility of general-purpose programming languages such as Python or Pandas to manage file-based data and complex analytical workflows. Despite this growing need, the reliability of Text-to-Python in core data retrieval remains underexplored relative to the mature SQL ecosystem. To address this gap, we introduce BIRD-Python, a benchmark designed for cross-paradigm evaluation. We systematically refined the original dataset to reduce annotation noise and align execution semantics, thereby establishing a consistent and standardized baseline for comparison. Our analysis reveals a fundamental paradigmatic divergence: whereas SQL leverages implicit DBMS behaviors through its declarative structure, Python requires explicit procedural logic, making it highly sensitive to underspecified user intent. To mitigate this challenge, we propose the Logic Completion Framework (LCF), which resolves ambiguity by incorporating latent domain knowledge into the generation process. Experimental results show that (1) performance differences primarily stem from missing domain context rather than inherent limitations in code generation, and (2) when these gaps are addressed, Text-to-Python achieves performance parity with Text-to-SQL. These findings establish Python as a viable foundation for analytical agents-provided that systems effectively ground ambiguous natural language inputs in executable logical specifications. Resources are available at https://anonymous.4open.science/r/Bird-Python-43B7/.

Problem

Research questions and friction points this paper is trying to address.

Text-to-Python

Text-to-SQL

ambiguity

explicit logic

domain context

Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-to-Python

Logic Completion Framework

BIRD-Python