🤖 AI Summary
Computing the exact Hausdorff distance (HD) for large-scale, high-dimensional data is computationally prohibitive, limiting its practical applicability. To address this, we propose ProHD—a projection-based algorithm that jointly leverages centroid-aligned axes and principal component analysis (PCA) to select representative projection directions. These directions enable efficient identification of a small subset of extremal points, on which an approximate HD is computed. ProHD provides rigorous theoretical guarantees: its estimate is always a lower bound of the true HD, and the approximation error is bounded. On datasets with 2 million points in 256 dimensions, ProHD achieves 10–100× speedup over exact algorithms and reduces estimation error by 5–20× compared to random sampling. Its lightweight design supports integration into vector database retrieval systems and enables real-time processing of streaming data. Consequently, ProHD significantly enhances the scalability and practical utility of HD computation in real-world applications.
📝 Abstract
The Hausdorff distance (HD) is a robust measure of set dissimilarity, but computing it exactly on large, high-dimensional datasets is prohibitively expensive. We propose extbf{ProHD}, a projection-guided approximation algorithm that dramatically accelerates HD computation while maintaining high accuracy. ProHD identifies a small subset of candidate "extreme" points by projecting the data onto a few informative directions (such as the centroid axis and top principal components) and computing the HD on this subset. This approach guarantees an underestimate of the true HD with a bounded additive error and typically achieves results within a few percent of the exact value. In extensive experiments on image, physics, and synthetic datasets (up to two million points in $D=256$), ProHD runs 10--100$ imes$ faster than exact algorithms while attaining 5--20$ imes$ lower error than random sampling-based approximations. Our method enables practical HD calculations in scenarios like large vector databases and streaming data, where quick and reliable set distance estimation is needed.