đ€ AI Summary
High-dimensional data analysis faces fundamental challenges including empirically unjustified embedding method selection, poorly characterized performance bounds, and fragmented theoretical debates. This study systematically reviews mainstream dimensionality reduction techniquesâincluding t-SNE, UMAP, PCA, and autoencodersâsynthesizing scattered literature and key controversies to propose the first practice-oriented three-dimensional framework for low-dimensional embeddings: âgenerationâevaluationâapplication.â We conduct a comprehensive empirical evaluation across diverse real-world datasets and downstream tasks, rigorously characterizing each algorithmâs trade-offs in preserving local versus global structure, robustness to noise and hyperparameter variation, and interpretability. Our analysis establishes clear applicability boundaries and inherent limitations for each method. The resulting best-practice guidelines integrate theoretical rigor with engineering feasibility, providing the field with standardized evaluation protocols and principled criteria for method selection. (149 words)
đ Abstract
Large collections of high-dimensional data have become nearly ubiquitous across many academic fields and application domains, ranging from biology to the humanities. Since working directly with high-dimensional data poses challenges, the demand for algorithms that create low-dimensional representations, or embeddings, for data visualization, exploration, and analysis is now greater than ever. In recent years, numerous embedding algorithms have been developed, and their usage has become widespread in research and industry. This surge of interest has resulted in a large and fragmented research field that faces technical challenges alongside fundamental debates, and it has left practitioners without clear guidance on how to effectively employ existing methods. Aiming to increase coherence and facilitate future work, in this review we provide a detailed and critical overview of recent developments, derive a list of best practices for creating and using low-dimensional embeddings, evaluate popular approaches on a variety of datasets, and discuss the remaining challenges and open problems in the field.