🤖 AI Summary
This paper studies the dynamic maximum coverage problem under the turnstile streaming model: given a data stream with sliding-window updates, dynamically select $k$ subsets from $d$ candidates to maximize the size of their union. It further extends this framework to re-identification risk analysis by identifying high-risk fingerprint features. We propose the first streaming algorithm for maximum coverage supporting polylogarithmic-in-$n$ update time, integrating frequency moment estimation (for $p geq 2$), sketch-based compression, streaming hash functions, and hierarchical sampling. We also develop two risk identification frameworks—targeted (leveraging known target fingerprints) and generic (requiring no prior knowledge)—both achieving theoretically optimal approximation ratios. Empirically, our method accelerates fingerprint identification by up to 210× over prior approaches, while ensuring rigorous theoretical guarantees and practical deployability.
📝 Abstract
In the maximum coverage problem we are given $d$ subsets from a universe $[n]$, and the goal is to output $k$ subsets such that their union covers the largest possible number of distinct items. We present the first algorithm for maximum coverage in the turnstile streaming model, where updates which insert or delete an item from a subset come one-by-one. Notably our algorithm only uses $polylog n$ update time. We also present turnstile streaming algorithms for targeted and general fingerprinting for risk management where the goal is to determine which features pose the greatest re-identification risk in a dataset. As part of our work, we give a result of independent interest: an algorithm to estimate the complement of the $p^{ ext{th}}$ frequency moment of a vector for $p geq 2$. Empirical evaluation confirms the practicality of our fingerprinting algorithms demonstrating a speedup of up to $210$x over prior work.