🤖 AI Summary
This work addresses the computational bottleneck arising from frequent re-execution of nonlinear dimensionality reduction methods during high-dimensional data exploration. To this end, we propose an efficient approach based on bipartite landmark sampling, which constructs a sparse point-to-landmark fuzzy graph and combines Nyström spectral initialization with UMAP-style optimization to enable rapid repeated embeddings. The method introduces a tunable landmark ratio \( r = m/n \), offering an explicit trade-off between runtime and embedding fidelity, specifically tailored for interactive exploration. Empirical evaluation demonstrates that our approach achieves the fastest performance on seven out of nine benchmark datasets; notably, it processes MNIST and Fashion-MNIST (\( n=70{,}000 \)) in just 4.6 seconds while attaining a kNN accuracy of 91.4%, substantially outperforming t-SNE, which requires 73–75 seconds.
📝 Abstract
Exploratory analysis of high-dimensional data rarely stops at a single embedding. In practice, analysts rerun dimensionality reduction after changing preprocessing, subsets, or hyperparameters, and standard nonlinear methods can quickly become the bottleneck. We introduce FastUMAP (Bipartite Manifold Approximation and Projection), a landmark-based method designed for this repeated-use setting. FastUMAP builds a sparse point-landmark fuzzy graph, computes a Nystrom spectral warm start from the induced landmark affinity, and then refines all sample coordinates with a UMAP-style objective on the bipartite graph. The landmark ratio r = m/n provides a direct way to trade runtime against fidelity. On 9 benchmark datasets spanning 178 to 70,000 samples, FastUMAP has the lowest runtime on 7 datasets in our reported default-implementation comparison on one workstation. On MNIST and Fashion-MNIST (n=70000), it runs in about 4.6 seconds, compared with about 73--75 seconds for Barnes--Hut t-SNE, while reaching 91.4% mean kNN accuracy versus 94.6% for the strongest accuracy baseline. FastUMAP is therefore best viewed as a fast option for repeated exploratory embedding, rather than as a replacement for accuracy-first methods.