🤖 AI Summary
This paper addresses the cardinality estimation problem for multi-join queries involving selection and grouping. We propose a provably correct, pessimistic, and tight upper-bound estimator. Methodologically, we formulate the problem as a linear program by jointly leveraging ℓₚ-norm statistics of input relations and Shannon’s information inequalities; we further introduce two query-graph-structure-aware optimizations to balance accuracy and efficiency. The estimator relies solely on lightweight degree-sequence statistics—requiring neither sampling nor materialized views. Evaluated on the JOB, STATS, and subgraph-matching benchmarks, it achieves 1–3 orders-of-magnitude higher accuracy than conventional database optimizers. Its time and space overheads are negligible. When integrated into PostgreSQL, the generated query execution plans attain quality nearly matching those produced using exact cardinalities.
📝 Abstract
Cardinality estimation is the problem of estimating the size of the output of a query, without actually evaluating the query. The cardinality estimator is a critical piece of a query optimizer, and is often the main culprit when the optimizer chooses a poor plan. This paper introduces LpBound, a pessimistic cardinality estimator for multijoin queries (acyclic or cyclic) with selection predicates and group-by clauses. LpBound computes a guaranteed upper bound on the size of the query output using simple statistics on the input relations, consisting of $ell_p$-norms of degree sequences. The bound is the optimal solution of a linear program whose constraints encode data statistics and Shannon inequalities. We introduce two optimizations that exploit the structure of the query in order to speed up the estimation time and make LpBound practical. We experimentally evaluate LpBound against a range of traditional, pessimistic, and machine learning-based estimators on the JOB, STATS, and subgraph matching benchmarks. Our main finding is that LpBound can be orders of magnitude more accurate than traditional estimators used in mainstream open-source and commercial database systems. Yet it has comparable low estimation time and space requirements. When injected the estimates of LpBound, Postgres derives query plans at least as good as those derived using the true cardinalities.