Bootstrapping Learned Cost Models with Synthetic SQL Queries

📅 2025-08-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The scarcity of realistic SQL query workloads hinders effective training of learned cost models. Method: This paper proposes a generative AI- and large language model–inspired synthetic data generation method to efficiently construct high-fidelity, diverse SQL query–cost labeled datasets tailored to specific database instances. Our approach integrates semantic-aware query sampling, precise execution cost annotation, and distribution optimization—reducing reliance on real-world data while preserving training efficacy. Results: Experiments show that our method achieves superior cost prediction accuracy (23.6% lower average relative error) on mainstream database engines using only 55% of the queries required by conventional approaches. It significantly improves training efficiency and model generalization, providing robust support for database performance optimization and stress testing.

Technology Category

Application Category

📝 Abstract
Having access to realistic workloads for a given database instance is extremely important to enable stress and vulnerability testing, as well as to optimize for cost and performance. Recent advances in learned cost models have shown that when enough diverse SQL queries are available, one can effectively and efficiently predict the cost of running a given query against a specific database engine. In this paper, we describe our experience in exploiting modern synthetic data generation techniques, inspired by the generative AI and LLM community, to create high-quality datasets enabling the effective training of such learned cost models. Initial results show that we can improve a learned cost model's predictive accuracy by training it with 45% fewer queries than when using competitive generation approaches.
Problem

Research questions and friction points this paper is trying to address.

Generating synthetic SQL queries for realistic database workloads
Improving learned cost models' predictive accuracy with fewer queries
Leveraging generative AI techniques for training data creation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Using synthetic SQL queries from generative AI
Training learned cost models with fewer queries
Improving predictive accuracy with synthetic data
Michael Nidd
Michael Nidd
IBM Research
Computer networksComputer securityMachine LearningCloud Operations
C
Christoph Miksovic
IBM Research Europe
T
Thomas Gschwind
IBM Research Europe
F
Francesco Fusco
IBM Research Europe
A
Andrea Giovannini
IBM Research Europe
Ioana Giurgiu
Ioana Giurgiu
IBM Zurich
Cloud computingBig dataMobile devices