On Fine-Grained Distinct Element Estimation

📅 2025-06-27

📈 Citations: 0

✨ Influential: 0

career value

255K/year

🤖 AI Summary

This paper studies the (1+ε)-approximate distinct count estimation problem in distributed multi-server settings, aiming to minimize total communication cost. Recognizing the gap between pessimistic worst-case lower bounds and significantly lower practical communication, we introduce the number of pairwise hash collisions, denoted *C*, as a refined complexity parameter and establish tight communication upper and lower bounds parameterized by *C*. Our approach integrates randomized hashing, frequency moment estimation, and information-theoretic lower bound analysis to design a streaming low-communication protocol. Theoretically, when *C* is small, the communication complexity improves to *O*(α log *n* + √β/ε² log *n*) bits—breaking the classical Ω(*n*) lower bound. Empirically, this explains why real-world systems often outperform worst-case theoretical guarantees. To our knowledge, this is the first work to identify *C* as the intrinsic hardness measure for distributed distinct counting, unifying fine-grained complexity characterization with practical performance gains.

Technology Category

Application Category

📝 Abstract

We study the problem of distributed distinct element estimation, where $α$ servers each receive a subset of a universe $[n]$ and aim to compute a $(1+varepsilon)$-approximation to the number of distinct elements using minimal communication. While prior work establishes a worst-case bound of $Θleft(αlog n+fracα{varepsilon^2} ight)$ bits, these results rely on assumptions that may not hold in practice. We introduce a new parameterization based on the number $C = fracβ{varepsilon^2}$ of pairwise collisions, i.e., instances where the same element appears on multiple servers, and design a protocol that uses only $mathcal{O}left(αlog n+frac{sqrtβ}{varepsilon^2} log n ight)$ bits, breaking previous lower bounds when $C$ is small. We further improve our algorithm under assumptions on the number of distinct elements or collisions and provide matching lower bounds in all regimes, establishing $C$ as a tight complexity measure for the problem. Finally, we consider streaming algorithms for distinct element estimation parameterized by the number of items with frequency larger than $1$. Overall, our results offer insight into why statistical problems with known hardness results can be efficiently solved in practice.

Problem

Research questions and friction points this paper is trying to address.

Distributed distinct element estimation with minimal communication

Improving bounds using pairwise collision parameterization

Streaming algorithms for frequency-based distinct element estimation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Parameterization based on pairwise collisions

Protocol with reduced communication bits

Improved algorithm under specific assumptions

🔎 Similar Papers

No similar papers found.

Bosch Group

Hildesheim, NDS, DE

Master Thesis Automated Scalable Deployment of Predictive Maintenance in Cloud

Bosch Group

Stuttgart, Germany

Authors to Follow