Learning-based Sketches for Frequency Estimation in Data Streams without Ground Truth

📅 2024-12-04
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limitations of single-key frequency estimation in high-speed data streams—namely, low accuracy, slow update speed, and reliance on ground-truth labels under strict memory constraints—this paper proposes UCL-sketch, an unsupervised online learning sketch. Methodologically, it introduces (1) a novel label-free online training mechanism that enables real-time, self-adaptive parameter updates via equivalent learning, and (2) a hierarchical logical estimation bucket architecture that balances fine-grained accuracy and computational scalability within bounded memory. Its equation-driven sketch framework supports efficient incremental inference. Experiments on both real-world and synthetic datasets demonstrate that UCL-sketch significantly outperforms state-of-the-art methods—including Count-Min and DeepSketch—reducing single-key estimation error by 40%–65%, achieving superior frequency distribution fitting, while maintaining comparable memory overhead.

Technology Category

Application Category

📝 Abstract
Estimating the frequency of items on the high-volume, fast data stream has been extensively studied in many areas, such as database and network measurement. Traditional sketch algorithms only allow to give very rough estimates with limited memory cost, whereas some learning-augmented algorithms have been proposed recently, their offline framework requires actual frequencies that are challenging to access in general for training, and speed is too slow for real-time processing, despite the still coarse-grained accuracy. To this end, we propose a more practical learning-based estimation framework namely UCL-sketch, by following the line of equation-based sketch to estimate per-key frequencies. In a nutshell, there are two key techniques: online training via equivalent learning without ground truth, and highly scalable architecture with logical estimation buckets. We implemented experiments on both real-world and synthetic datasets. The results demonstrate that our method greatly outperforms existing state-of-the-art sketches regarding per-key accuracy and distribution, while preserving resource efficiency. Our code is attached in the supplementary material, and will be made publicly available at https://github.com/Y-debug-sys/UCL-sketch.
Problem

Research questions and friction points this paper is trying to address.

Estimating item frequencies in high-volume data streams without ground truth
Overcoming slow update speeds in learning-augmented frequency estimation methods
Achieving high accuracy under strict memory constraints for real-time processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Online training without ground truth
Scalable structured estimation buckets
Compressive sensing for error bound
🔎 Similar Papers
No similar papers found.
Xinyu Yuan
Xinyu Yuan
Mila - Quebec AI Institute
Cell/Protein/Graph Representation LearningKnowledge Graph Reasoning
Yan Qiao
Yan Qiao
Macau University of Science and Technology
Semicond. smart manuf.scheduling and controlAI and its applications
M
Meng Li
School of Computer Science and Information Engineering, Hefei University of Technology, China
Z
Zhenchun Wei
School of Computer Science and Information Engineering, Hefei University of Technology, China
C
Cuiying Feng
School of Information and Software Engineering, University of Electronic Science and Technology of China, China