🤖 AI Summary
Text-to-SQL systems for enterprise data lakes aim to empower non-technical users (e.g., product and operations staff) to autonomously derive data insights via natural language. Addressing key enterprise challenges—dynamic schemas, high query complexity, and deep contextual dependencies—this work proposes: (1) a multi-source dynamic knowledge graph integrating database metadata, query logs, and domain documentation to enable schema-aware table recommendation and semantic retrieval; (2) a Text-to-SQL agent with automated syntactic error correction and retrieval-augmented generation (RAG) to enhance robustness on complex queries; and (3) an intelligent chat interface supporting multi-turn interaction, rich UI feedback, and query debugging. Deployed at LinkedIn, the system achieved over 300 weekly active users. On an internal benchmark, 53% of responses were correct or semantically equivalent. Ablation studies confirm significant performance gains from each component.
📝 Abstract
The introduction of large language models has brought rapid progress on Text-to-SQL benchmarks, but it is not yet easy to build a working enterprise solution. In this paper, we present insights from building an internal chatbot that enables LinkedIn's product managers, engineers, and operations teams to self-serve data insights from a large, dynamic data lake. Our approach features three components. First, we construct a knowledge graph that captures up-to-date semantics by indexing database metadata, historical query logs, wikis, and code. We apply clustering to identify relevant tables for each team or product area. Second, we build a Text-to-SQL agent that retrieves and ranks context from the knowledge graph, writes a query, and automatically corrects hallucinations and syntax errors. Third, we build an interactive chatbot that supports various user intents, from data discovery to query writing to debugging, and displays responses in rich UI elements to encourage follow-up chats. Our chatbot has over 300 weekly users. Expert review shows that 53% of its responses are correct or close to correct on an internal benchmark set. Through ablation studies, we identify the most important knowledge graph and modeling components, offering a practical path for developing enterprise Text-to-SQL solutions.