🤖 AI Summary
This paper studies the (1+ε)-approximate distinct count estimation problem in distributed multi-server settings, aiming to minimize total communication cost. Recognizing the gap between pessimistic worst-case lower bounds and significantly lower practical communication, we introduce the number of pairwise hash collisions, denoted *C*, as a refined complexity parameter and establish tight communication upper and lower bounds parameterized by *C*. Our approach integrates randomized hashing, frequency moment estimation, and information-theoretic lower bound analysis to design a streaming low-communication protocol. Theoretically, when *C* is small, the communication complexity improves to *O*(α log *n* + √β/ε² log *n*) bits—breaking the classical Ω(*n*) lower bound. Empirically, this explains why real-world systems often outperform worst-case theoretical guarantees. To our knowledge, this is the first work to identify *C* as the intrinsic hardness measure for distributed distinct counting, unifying fine-grained complexity characterization with practical performance gains.
📝 Abstract
We study the problem of distributed distinct element estimation, where $α$ servers each receive a subset of a universe $[n]$ and aim to compute a $(1+varepsilon)$-approximation to the number of distinct elements using minimal communication. While prior work establishes a worst-case bound of $Θleft(αlog n+fracα{varepsilon^2}
ight)$ bits, these results rely on assumptions that may not hold in practice. We introduce a new parameterization based on the number $C = fracβ{varepsilon^2}$ of pairwise collisions, i.e., instances where the same element appears on multiple servers, and design a protocol that uses only $mathcal{O}left(αlog n+frac{sqrtβ}{varepsilon^2} log n
ight)$ bits, breaking previous lower bounds when $C$ is small. We further improve our algorithm under assumptions on the number of distinct elements or collisions and provide matching lower bounds in all regimes, establishing $C$ as a tight complexity measure for the problem. Finally, we consider streaming algorithms for distinct element estimation parameterized by the number of items with frequency larger than $1$. Overall, our results offer insight into why statistical problems with known hardness results can be efficiently solved in practice.