CoddLLM: Empowering Large Language Models for Data Analytics

📅 2025-02-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address challenges in natural language understanding and structured query generation for data intelligence analytics, this paper introduces Databot, a domain-specific large language model tailored for database and data lake scenarios. Methodologically, we propose the first synthetic data generation paradigm specifically designed for data management, introduce a novel table-text collaborative task, and leverage the Mistral-NeMo-12B backbone enhanced by multi-stage, data-recipe-driven post-training and multi-granularity table-text alignment modeling. Our contributions include: (1) releasing AnalyticsMMLU—the first comprehensive evaluation benchmark for data intelligence—as well as three specialized benchmarks for data discovery; (2) achieving state-of-the-art average accuracy across eight Text-to-SQL datasets; and (3) outperforming GPT-3.5-Turbo on AnalyticsMMLU, improving table selection accuracy by 12.1% over GPT-4o, and boosting Text-to-SQL performance by 24.9% relative to the baseline.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have the potential to revolutionize data analytics by simplifying tasks such as data discovery and SQL query synthesis through natural language interactions. This work serves as a pivotal first step toward the development of foundation models explicitly designed for data analytics applications. To propel this vision forward, we unveil a new data recipe for post-training LLMs, enhancing their comprehension of data management and empowering them to tackle complex real-world analytics tasks. Specifically, our innovative approach includes a scalable synthetic data generation method that enables the creation of a broad spectrum of topics centered on data representation and manipulation. Furthermore, we introduce two new tasks that seamlessly bridge tables and text. We show that such tasks can enhance models' understanding of schema creation and the nuanced translation between natural language and tabular data. Leveraging this data recipe, we post-train a new foundation model, named CoddLLM, based on Mistral-NeMo-12B. To assess the language understanding and reasoning capabilities of LLMs in the realm of data analytics, we contribute AnalyticsMMLU, a benchmark containing thousands of multiple-choice questions on databases, data analysis, and machine learning. Our focus on data discovery, has resulted in the contribution of three comprehensive benchmarks that address both database and data lake scenarios. CoddLLM not only excels in performance but also sets a new standard, achieving the highest average accuracy across eight datasets. It outperforms GPT-3.5-Turbo on AnalyticsMMLU, exceeding GPT-4o by 12.1% in table selection and showing an average improvement of 24.9% in Text-to-SQL compared to the base model.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Data Parsing
SQL Query Transformation
Innovation

Methods, ideas, or system contributions that make the work stand out.

CoddLLM
data analysis
text-to-SQL
🔎 Similar Papers
No similar papers found.