Can Large Language Models be a Cardinality Estimator? An Empirical study

πŸ“… 2026-03-30
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing cardinality estimation methods face limitations in generalization, support for complex queries, and data preparation overhead, hindering their applicability in real-world database systems. This work presents the first systematic exploration of large language models (LLMs) for cardinality estimation, leveraging prompt engineering, parameter fine-tuning, and a self-correction mechanism during inference to significantly improve estimation accuracy under both in-distribution and out-of-distribution scenarios. The proposed approach consistently outperforms state-of-the-art methods across nearly all experimental settings. Furthermore, end-to-end query execution experiments demonstrate that the performance gains effectively offset the inference overhead, highlighting the method’s strong data efficiency and capability in handling complex queries.
πŸ“ Abstract
Cardinality estimation (CardEst) still remains a challenging problem for DBMS. Recent years have witnessed the success of ML-based cardinality estimators in outperforming traditional methods. However, these solutions suffer from poor generalizability to new data or query distribution, inability to handle complex queries, and substantial data preparation overhead, thus preventing their wide adoption in the real-world DBMS. Some recent efforts have been dedicated to addressing some but not all of these issues. We notice that the recent emerging Large Language Models (LLMs) have shown their remarkable generalizability to unseen tasks, capabilities to understand complex programs, and power to perform data-efficient fine-tuning. In light of this, we propose to leverage LLMs to mitigate the above issues. Specifically, we carefully craft prompts, and subsequently perform fine-tuning and self-correction during inference with LLMs for CardEst task. We then extensively evaluate LLMs' in-distribution and out-of-distribution generalizability, feasibility to support complex queries, and training data efficiency during fine-tuning LLMs on pre-training datasets. The results suggest that LLMs outperform the state-of-the-art in almost all settings, thus indicating their potential for the CardEst task. We further measure the end-to-end query execution time in DBMS by using the estimated cardinalities of LLMs in some practical settings, which suggests that the inference overhead of LLMs can be outweighed by the benefits brought by LLMs for CardEst.
Problem

Research questions and friction points this paper is trying to address.

Cardinality Estimation
Database Management Systems
Machine Learning
Query Optimization
Generalizability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models
Cardinality Estimation
Prompt Engineering
Fine-tuning
Out-of-Distribution Generalization
πŸ”Ž Similar Papers
No similar papers found.
L
Liangzu Liu
Peking University
Y
Yiyan Wang
Peking University
Y
Yinjun Wu
Peking University
R
Runze Su
Peking University
Z
Zhuo Chang
Peking University
P
Peizhi Wu
Bytedance, University of Pennsylvania
Jianjun Chen
Jianjun Chen
ByteDance
database
Fuxin Jiang
Fuxin Jiang
ByteDance
TimeSeries ForecastingResource SchedulingLLM
Rui Shi
Rui Shi
ByteDance, Inc.
Database SystemsBig DataDistributed SystemsCloud NativeProgramming Languages
B
Bin Cui
Peking University
Tieying Zhang
Tieying Zhang
Research Scientist at Bytedance
AI for SystemsSystems for AI