🤖 AI Summary
To address the space-efficiency bottleneck in cardinality estimation for ultra-large-scale data streams, this paper proposes an approximate distinct counting data structure tailored for exa-scale scenarios. The method leverages a probabilistic framework integrating optimized random hashing with compact bitmap compression, balancing theoretical soundness and engineering practicality. Our approach achieves, for the first time, a 43% memory reduction over HyperLogLog at the same relative error (e.g., 0.8%), while supporting cardinality estimation up to >10¹⁸ distinct elements per instance. It provides production-grade properties—including commutativity, idempotence, mergeability, reducibility, and O(1) insertion time. Extensive evaluation on real-world stream processing systems demonstrates sub-millisecond latency and linear scalability. This work establishes a new, efficient, and reliable paradigm for exa-scale real-time cardinality estimation.
📝 Abstract
This work introduces ExaLogLog, a new data structure for approximate distinct counting, which has the same practical properties as the popular HyperLogLog algorithm. It is commutative, idempotent, mergeable, reducible, has a constant-time insert operation, and supports distinct counts up to the exa-scale. At the same time, as theoretically derived and experimentally verified, it requires 43% less space to achieve the same estimation error.