🤖 AI Summary
This paper studies ℓₚ sampling and frequency moment Fₚ estimation over single-pass insert-only data streams, focusing on the nonlinear modeling challenges introduced by *forget operations*—i.e., resetting the frequency of a single element. We propose the first nearly unbiased, space-optimal algorithm for continuous ℓₚ sampling and develop a unified framework supporting both the forget model and the suffix–prefix deletion model. Notably, we generalize ℓₚ sampling to arbitrary coordinate-wise functions—including nonlinear compression functions—thereby breaking the traditional linearity assumption. Technically, our approach integrates bias-controllable estimators, space-efficient hashing, and dynamic frequency updates. For p ∈ (0,2), our algorithm achieves Õ(log n · log(1/δ)) space complexity; for p = 2, it requires Õ(log²n · log(1/δ)). Our work resolves three open problems posed at PODS ’24 and yields a near-optimal entropy estimation algorithm.
📝 Abstract
We study $ell_p$ sampling and frequency moment estimation in a single-pass insertion-only data stream. For $p in (0,2)$, we present a nearly space-optimal approximate $ell_p$ sampler that uses $widetilde{O}(log n log(1/δ))$ bits of space and for $p = 2$, we present a sampler with space complexity $widetilde{O}(log^2 n log(1/δ))$. This space complexity is optimal for $p in (0, 2)$ and improves upon prior work by a $log n$ factor. We further extend our construction to a continuous $ell_p$ sampler, which outputs a valid sample index at every point during the stream.
Leveraging these samplers, we design nearly unbiased estimators for $F_p$ in data streams that include forget operations, which reset individual element frequencies and introduce significant non-linear challenges. As a result, we obtain near-optimal algorithms for estimating $F_p$ for all $p$ in this model, originally proposed by Pavan, Chakraborty, Vinodchandran, and Meel [PODS'24], resolving all three open problems they posed.
Furthermore, we generalize this model to what we call the suffix-prefix deletion model, and extend our techniques to estimate entropy as a corollary of our moment estimation algorithms. Finally, we show how to handle arbitrary coordinate-wise functions during the stream, for any $g in mathbb{G}$, where $mathbb{G}$ includes all (linear or non-linear) contraction functions.