🤖 AI Summary
Existing Text-to-SQL methods struggle with unstructured data and semantically ambiguous queries, while VectorSQL systems rely on manual construction and lack customizable evaluation, preventing their theoretical potential from translating into practical utility.
Method: We propose Text2VectorSQL, the first framework unifying Text-to-SQL generation with vector retrieval to enable semantic filtering, multimodal matching, and retrieval acceleration. Our approach introduces an end-to-end automated annotation pipeline integrating SQL generation, vector index construction, semantic query expansion, synthetic data training, and expert validation.
Contribution/Results: Experiments demonstrate that our model significantly outperforms baselines across diverse natural language database querying tasks. We establish Text2VectorSQL as a novel task paradigm and provide a scalable, evaluable, unified foundation for general-purpose natural language database interfaces.
📝 Abstract
While Text-to-SQL enables natural language interaction with structured databases, its effectiveness diminishes with unstructured data or ambiguous queries due to rigid syntax and limited expressiveness. Concurrently, vector search has emerged as a powerful paradigm for semantic retrieval, particularly for unstructured data. However, existing VectorSQL implementations still rely heavily on manual crafting and lack tailored evaluation frameworks, leaving a significant gap between theoretical potential and practical deployment. To bridge these complementary paradigms, we introduces Text2VectorSQL, a novel framework unifying Text-to-SQL and vector search to overcome expressiveness constraints and support more diverse and holistical natural language queries. Specifically, Text2VectorSQL enables semantic filtering, multi-modal matching, and retrieval acceleration. For evaluation, we build vector index on appropriate columns, extend user queries with semantic search, and annotate ground truths via an automatic pipeline with expert review. Furthermore, we develop dedicated Text2VectorSQL models with synthetic data, demonstrating significant performance improvements over baseline methods. Our work establishes the foundation for the Text2VectorSQL task, paving the way for more versatile and intuitive database interfaces. The repository will be publicly available at https://github.com/Open-DataFlow/Text2VectorSQL.