🤖 AI Summary
This paper addresses the precise natural language query routing problem in enterprise multi-source database environments—automatically identifying the most relevant database amid challenges including large-scale databases, domain similarity, scarce metadata, and ambiguous or complex queries. We formally define the database-level query routing task and identify four key difficulty factors: database scale, domain overlap, entity interpretability, and query ambiguity. To tackle this, we propose a dual-path baseline framework: (1) zero-shot routing using open-source large language models (LLMs), and (2) fine-tuned semantic embedding leveraging database schemas and synthetically generated NL-to-SQL data. Experiments show that LLMs excel on short queries but suffer from context-length limitations; fine-tuned embeddings significantly improve accuracy when domain-specific training data is sufficient. Quantitative analysis confirms the detrimental impact of all four identified difficulty factors on routing performance. To our knowledge, this is the first systematic modeling and solution for database-level NL query routing.
📝 Abstract
Enterprise level data is often distributed across multiple sources and identifying the correct set-of data-sources with relevant information for a knowledge request is a fundamental challenge. In this work, we define the novel task of routing an end-user query to the appropriate data-source, where the data-sources are databases. We synthesize datasets by extending existing datasets designed for NL-to-SQL semantic parsing. We create baselines on these datasets by using open-source LLMs, using both pre-trained and task specific embeddings fine-tuned using the training data. With these baselines we demonstrate that open-source LLMs perform better than embedding based approach, but suffer from token length limitations. Embedding based approaches benefit from task specific fine-tuning, more so when there is availability of data in terms of database specific questions for training. We further find that the task becomes more difficult (i) with an increase in the number of data-sources, (ii) having data-sources closer in terms of their domains,(iii) having databases without external domain knowledge required to interpret its entities and (iv) with ambiguous and complex queries requiring more fine-grained understanding of the data-sources or logical reasoning for routing to an appropriate source. This calls for the need for developing more sophisticated solutions to better address the task.