A smoothed-Bayesian approach to frequency recovery from sketched data

📅 2023-09-27

📈 Citations: 1

✨ Influential: 0

career value

216K/year

🤖 AI Summary

This paper addresses the problem of efficiently and accurately recovering empirical symbol frequencies from compressed sketches generated via random hashing, particularly for large-scale discrete data exhibiting power-law or other heavy-tailed distributions. To overcome the computational intractability of conventional Bayesian nonparametric approaches, we propose a smoothed Bayesian estimation framework that synergistically integrates Bayesian modeling intuition with frequentist inference principles. We further introduce a novel multi-view learning strategy that unifies estimation for both single- and multi-hash sketches, achieving a principled balance between theoretical optimality—including unbiasedness and minimum mean-squared error—and practical scalability. Experimental results on synthetic and real-world datasets demonstrate that our method substantially outperforms state-of-the-art techniques: it yields unbiased estimates with optimal MSE under single hashing, and maintains high efficiency and robustness to heavy-tailed frequency distributions under multi-hashing.

📝 Abstract

We provide a novel statistical perspective on a classical problem at the intersection of computer science and information theory: recovering the empirical frequency of a symbol in a large discrete dataset using only a compressed representation, or sketch, obtained via random hashing. Departing from traditional algorithmic approaches, recent works have proposed Bayesian nonparametric (BNP) methods that can provide more informative frequency estimates by leveraging modeling assumptions about the distribution of the sketched data. In this paper, we propose a smoothed-Bayesian method, inspired by existing BNP approaches but designed in a frequentist framework to overcome the computational limitations of the BNP approaches when dealing with large-scale data from realistic distributions, including those with power-law tail behaviors. For sketches obtained with a single hash function, our approach is supported by rigorous frequentist properties, including unbiasedness and optimality under a squared error loss function within an intuitive class of linear estimators. For sketches with multiple hash functions, we introduce an approach based on multi-view learning to construct computationally efficient frequency estimators. We validate our method on synthetic and real data, comparing its performance to that of existing alternatives.

Problem

Research questions and friction points this paper is trying to address.

Recover symbol frequencies from compressed sketches using Bayesian methods

Overcome computational limits of Bayesian approaches for large-scale data

Develop efficient estimators for sketches with multiple hash functions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Smoothed-Bayesian method for frequency recovery

Multi-view learning with multiple hash functions

Frequentist framework for large-scale data

🔎 Similar Papers

Data-Efficient Sleep Staging with Synthetic Time Series Pretraining

2024-03-13arXiv.orgCitations: 0

Comparison Performance of Spectrogram and Scalogram as Input of Acoustic Recognition Task

2024-03-06FICCCitations: 14

💼 Related Jobs

Research Scientist