On Fine-Grained Distinct Element Estimation

📅 2025-06-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper studies the (1+ε)-approximate distinct count estimation problem in distributed multi-server settings, aiming to minimize total communication cost. Recognizing the gap between pessimistic worst-case lower bounds and significantly lower practical communication, we introduce the number of pairwise hash collisions, denoted *C*, as a refined complexity parameter and establish tight communication upper and lower bounds parameterized by *C*. Our approach integrates randomized hashing, frequency moment estimation, and information-theoretic lower bound analysis to design a streaming low-communication protocol. Theoretically, when *C* is small, the communication complexity improves to *O*(α log *n* + √β/ε² log *n*) bits—breaking the classical Ω(*n*) lower bound. Empirically, this explains why real-world systems often outperform worst-case theoretical guarantees. To our knowledge, this is the first work to identify *C* as the intrinsic hardness measure for distributed distinct counting, unifying fine-grained complexity characterization with practical performance gains.

Technology Category

Application Category

📝 Abstract
We study the problem of distributed distinct element estimation, where $α$ servers each receive a subset of a universe $[n]$ and aim to compute a $(1+varepsilon)$-approximation to the number of distinct elements using minimal communication. While prior work establishes a worst-case bound of $Θleft(αlog n+fracα{varepsilon^2} ight)$ bits, these results rely on assumptions that may not hold in practice. We introduce a new parameterization based on the number $C = fracβ{varepsilon^2}$ of pairwise collisions, i.e., instances where the same element appears on multiple servers, and design a protocol that uses only $mathcal{O}left(αlog n+frac{sqrtβ}{varepsilon^2} log n ight)$ bits, breaking previous lower bounds when $C$ is small. We further improve our algorithm under assumptions on the number of distinct elements or collisions and provide matching lower bounds in all regimes, establishing $C$ as a tight complexity measure for the problem. Finally, we consider streaming algorithms for distinct element estimation parameterized by the number of items with frequency larger than $1$. Overall, our results offer insight into why statistical problems with known hardness results can be efficiently solved in practice.
Problem

Research questions and friction points this paper is trying to address.

Distributed distinct element estimation with minimal communication
Improving bounds using pairwise collision parameterization
Streaming algorithms for frequency-based distinct element estimation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Parameterization based on pairwise collisions
Protocol with reduced communication bits
Improved algorithm under specific assumptions
🔎 Similar Papers
No similar papers found.