🤖 AI Summary
This work addresses the failure of traditional hash-based deduplication methods in high-dimensional noisy data streams, where observations are only approximately similar. It introduces MaxSketch, the first approach to leverage the geometric structure of learned representations for streaming distinct counting. By combining maximal linear sketching with random Gaussian projections, MaxSketch efficiently captures discriminative features of underlying objects. The method breaks the √n memory lower bound inherent to general metric spaces, achieving a (1+ε)-approximation with only Õ(log n/ε²) memory. Both theoretical analysis and experiments on image streams demonstrate its high accuracy, robustness to noise, and near-logarithmic memory efficiency.
📝 Abstract
Estimating the number of distinct elements in a data stream is well understood when repeated elements are identical. In modern settings, however, observations are high-dimensional and noisy, so repeated instances of the same object are only approximately similar -- for example, different images of the same individual may vary significantly at the pixel level. Classical sketches such as HyperLogLog rely on consistent hash values for identical elements and break down in this regime. Recent work on robust distinct counting in general metric spaces achieves $\widetildeΘ(\sqrt{n})$ memory, which is tight in the worst case. We show that substantially improved memory guarantees are possible under geometric structure common in learned representations. We introduce MaxSketch, a simple max-linear sketch built from random Gaussian projections, and prove that it succeeds in estimating the number of distinct latent objects. Concretely, we show that under this assumption $m = \widetilde{O} (\log n / \varepsilon^2)$ random projections (and hence $\widetilde{O} (\log n/\varepsilon^2)$ memory) suffice to recover the true distinct count within a $(1+\varepsilon)$ factor. Experiments on image streams confirm that MaxSketch accurately estimates distinct counts and generalizes beyond the training regime. Our results bridge classical streaming algorithms and modern representation learning, showing how geometric structure can fundamentally reduce the complexity of distinct counting.