🤖 AI Summary
Existing dimensionality reduction methods, such as UMAP and t-SNE, are prone to noise-induced distortions that compromise global topological structure while preserving local neighborhoods, often generating spurious loops or isolated clusters. This work proposes DiRe, a scalable, topology-aware dimensionality reduction algorithm that introduces the first benchmark specifically designed for evaluating topological fidelity. By integrating homology theory with efficient optimization strategies, DiRe accurately recovers the first Betti numbers of complex manifolds without sacrificing classification performance relative to GPU-accelerated UMAP. Evaluated on embeddings of 723,000 arXiv papers, DiRe preserves three to four times more genuine topological features than UMAP within comparable runtime and reliably reconstructs known topological structures under stress testing.
📝 Abstract
Dimensionality reduction methods such as UMAP and t-SNE are central tools for visualising high-dimensional data, but their local-neighborhood objectives can preserve sampling noise while distorting global topology. We show that standard local metrics reward this noise memorisation: top-performing embeddings invent cycles and disconnected islands absent from the data. We introduce a topology-faithfulness benchmark based on noisy manifolds with known homology, tune DiRe against it, and find Pareto-optimal configurations that match or beat GPU-accelerated UMAP on classification while recovering exact first Betti numbers on stress tests. On 723K arXiv paper embeddings, DiRe preserves 3-4 times more topological structure than UMAP at comparable wall-clock.