LEARNT: A Practical Estimator for Cardinality of LIKE Queries with Formal Accuracy Guarantees

📅 2026-05-22

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work addresses the problem of cardinality estimation for string LIKE queries—covering prefixes, suffixes, and substrings—and presents the first solution with formal Q-error guarantees. The proposed method formulates cardinality estimation as a bucket classification task and introduces a tunable, robust, and low-overhead hierarchical Bloom filter architecture. This design integrates a compact auxiliary table, a prefix traversal strategy, and a Markov model to support patterns of arbitrary length while effectively mitigating query skew. Evaluated on four real-world datasets, the approach achieves 1.3–1.7× lower average Q-error and substantially reduced tail errors compared to state-of-the-art methods such as CLIQUE and LPLM, while offering up to 70× faster construction time at comparable memory consumption.

📝 Abstract

We study the problem of cardinality estimation for LIKE queries on string data, focusing on the most common patterns in real workloads: prefix, suffix, and substring queries. We propose LEARNT, a LIKE query Estimator with Accuracy, Robustness, Negligible overhead, Tunability, and Theoretical guarantees. LEARNT formulates estimation as a bucket-classification problem, and upon correct classification, it yields formal bounds on Q-error for the queries with non-empty answer. It employs a memory-efficient bucketed layered-filter architecture with Bloom filters and compact auxiliary tables, together with optimizations that exploit query skew to reduce storage. For the queries that have empty answer, LEARNT incorporates dedicated filter-based and prefix-walk strategies, providing probabilistic guarantees on correct identification. Furthermore, to support arbitrarily long query strings, we extend LEARNT with Markov modeling scheme that composes short-query statistics into estimates for longer queries. A theoretical framework guides parameter selection to minimize storage under accuracy and robustness constraints. Extensive experiments on four real-world datasets show that LEARNT consistently outperforms state-of-the-art methods such as CLIQUE and LPLM, achieving 1.3-1.7x lower mean Q-error, significantly lower tail errors, and up to 70x faster construction with comparable memory usage.

Problem

Research questions and friction points this paper is trying to address.

cardinality estimation

LIKE queries

string data

prefix queries

substring queries

Innovation

Methods, ideas, or system contributions that make the work stand out.

cardinality estimation

LIKE queries

Bloom filters