🤖 AI Summary
To address the challenge non-expert users face in efficiently querying and analyzing large tabular data, this paper proposes an end-to-end framework that translates natural language queries into executable query plans using large language models (LLMs). Unlike conventional SQL generation approaches, our method employs iterative semantic parsing to map natural language into heterogeneous operation sequences—including statistical and machine learning primitives (e.g., PCA, anomaly detection)—and executes them externally to the database, thereby circumventing LLM context-length limitations and the overhead of full-data loading. Experiments on standard benchmarks and large-scale scientific tabular datasets demonstrate substantial improvements in task completion rate and execution efficiency for complex analytical tasks. The framework supports flexible, scalable data analysis beyond the expressive capacity of SQL, offering a novel pathway for the NL2Data analysis paradigm.
📝 Abstract
Efficient querying and analysis of large tabular datasets remain significant challenges, especially for users without expertise in programming languages like SQL. Text-to-SQL approaches have shown promising performance on benchmark data; however, they inherit SQL's drawbacks, including inefficiency with large datasets and limited support for complex data analyses beyond basic querying. We propose a novel framework that transforms natural language queries into query plans. Our solution is implemented outside traditional databases, allowing us to support classical SQL commands while avoiding SQL's inherent limitations. Additionally, we enable complex analytical functions, such as principal component analysis and anomaly detection, providing greater flexibility and extensibility than traditional SQL capabilities. We leverage LLMs to iteratively interpret queries and construct operation sequences, addressing computational complexity by incrementally building solutions. By executing operations directly on the data, we overcome context length limitations without requiring the entire dataset to be processed by the model. We validate our framework through experiments on both standard databases and large scientific tables, demonstrating its effectiveness in handling extensive datasets and performing sophisticated data analyses.