🤖 AI Summary
Manually annotating natural language questions for SQL logs in domain-specific text-to-SQL benchmark construction for private enterprise data warehouses is costly and inefficient.
Method: This paper proposes a lightweight human-in-the-loop annotation framework leveraging retrieval-augmented generation (RAG) and large language models to automatically generate candidate natural language utterances, followed by expert curation—including filtering, ranking, and fine-grained editing—to minimize reliance on manual authoring.
Contribution/Results: The core innovation lies in integrating generation controllability and domain semantic consistency into the collaborative workflow, balancing annotation efficiency and output quality. Experiments demonstrate a 62% reduction in annotation time and an 18% improvement in question-SQL alignment accuracy. The resulting benchmark enables more reliable evaluation of domain-specific text-to-SQL models. The system is open-sourced and deployed as an online service.
📝 Abstract
Large language models (LLMs) have been successfully applied to many tasks, including text-to-SQL generation. However, much of this work has focused on publicly available datasets, such as Fiben, Spider, and Bird. Our earlier work showed that LLMs are much less effective in querying large private enterprise data warehouses and released Beaver, the first private enterprise text-to-SQL benchmark. To create Beaver, we leveraged SQL logs, which are often readily available. However, manually annotating these logs to identify which natural language questions they answer is a daunting task. Asking database administrators, who are highly trained experts, to take on additional work to construct and validate corresponding natural language utterances is not only challenging but also quite costly. To address this challenge, we introduce BenchPress, a human-in-the-loop system designed to accelerate the creation of domain-specific text-to-SQL benchmarks. Given a SQL query, BenchPress uses retrieval-augmented generation (RAG) and LLMs to propose multiple natural language descriptions. Human experts then select, rank, or edit these drafts to ensure accuracy and domain alignment. We evaluated BenchPress on annotated enterprise SQL logs, demonstrating that LLM-assisted annotation drastically reduces the time and effort required to create high-quality benchmarks. Our results show that combining human verification with LLM-generated suggestions enhances annotation accuracy, benchmark reliability, and model evaluation robustness. By streamlining the creation of custom benchmarks, BenchPress offers researchers and practitioners a mechanism for assessing text-to-SQL models on a given domain-specific workload. BenchPress is freely available via our public GitHub repository at https://github.com/fabian-wenz/enterprise-txt2sql and is also accessible on our website at http://dsg-mcgraw.csail.mit.edu:5000.