Learning-based Sketches for Frequency Estimation in Data Streams without Ground Truth

📅 2024-12-04

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

To address the limitations of single-key frequency estimation in high-speed data streams—namely, low accuracy, slow update speed, and reliance on ground-truth labels under strict memory constraints—this paper proposes UCL-sketch, an unsupervised online learning sketch. Methodologically, it introduces (1) a novel label-free online training mechanism that enables real-time, self-adaptive parameter updates via equivalent learning, and (2) a hierarchical logical estimation bucket architecture that balances fine-grained accuracy and computational scalability within bounded memory. Its equation-driven sketch framework supports efficient incremental inference. Experiments on both real-world and synthetic datasets demonstrate that UCL-sketch significantly outperforms state-of-the-art methods—including Count-Min and DeepSketch—reducing single-key estimation error by 40%–65%, achieving superior frequency distribution fitting, while maintaining comparable memory overhead.

Technology Category

Application Category

📝 Abstract

Estimating the frequency of items on the high-volume, fast data stream has been extensively studied in many areas, such as database and network measurement. Traditional sketch algorithms only allow to give very rough estimates with limited memory cost, whereas some learning-augmented algorithms have been proposed recently, their offline framework requires actual frequencies that are challenging to access in general for training, and speed is too slow for real-time processing, despite the still coarse-grained accuracy. To this end, we propose a more practical learning-based estimation framework namely UCL-sketch, by following the line of equation-based sketch to estimate per-key frequencies. In a nutshell, there are two key techniques: online training via equivalent learning without ground truth, and highly scalable architecture with logical estimation buckets. We implemented experiments on both real-world and synthetic datasets. The results demonstrate that our method greatly outperforms existing state-of-the-art sketches regarding per-key accuracy and distribution, while preserving resource efficiency. Our code is attached in the supplementary material, and will be made publicly available at https://github.com/Y-debug-sys/UCL-sketch.

Problem

Research questions and friction points this paper is trying to address.

Estimating item frequencies in high-volume data streams without ground truth

Overcoming slow update speeds in learning-augmented frequency estimation methods

Achieving high accuracy under strict memory constraints for real-time processing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Online training without ground truth

Scalable structured estimation buckets

Compressive sensing for error bound

🔎 Similar Papers

A smoothed-Bayesian approach to frequency recovery from sketched data