Assessing and improving reliability of neighbor embedding methods: a map-continuity perspective

📅 2024-10-22
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Neighborhood embedding methods (e.g., t-SNE, UMAP) suffer from visualization artifacts—such as spurious cluster separation and distorted local structure—due to their lack of data-agnostic mappings. To address this, we propose LOO-map: the first framework that constructs a global, data-agnostic embedding mapping covering the entire input space. We formally define and quantify two types of mapping discontinuities—neighborhood collapse and neighborhood flip—and introduce a point-wise reliability scoring function to guide hyperparameter selection and assess embedding trustworthiness. Experiments on computer vision and single-cell multi-omics datasets demonstrate that LOO-map significantly suppresses structural distortions, enhances visual interpretability, and improves robustness in downstream analyses—including clustering and trajectory inference—without requiring re-embedding for new samples.

Technology Category

Application Category

📝 Abstract
Visualizing high-dimensional data is essential for understanding biomedical data and deep learning models. Neighbor embedding methods, such as t-SNE and UMAP, are widely used but can introduce misleading visual artifacts. We find that the manifold learning interpretations from many prior works are inaccurate and that the misuse stems from a lack of data-independent notions of embedding maps, which project high-dimensional data into a lower-dimensional space. Leveraging the leave-one-out principle, we introduce LOO-map, a framework that extends embedding maps beyond discrete points to the entire input space. We identify two forms of map discontinuity that distort visualizations: one exaggerates cluster separation and the other creates spurious local structures. As a remedy, we develop two types of point-wise diagnostic scores to detect unreliable embedding points and improve hyperparameter selection, which are validated on datasets from computer vision and single-cell omics.
Problem

Research questions and friction points this paper is trying to address.

Assessing reliability of neighbor embedding methods for high-dimensional data visualization
Identifying misleading visual artifacts in t-SNE and UMAP embeddings
Developing diagnostic scores to detect unreliable embedding points and improve hyperparameters
Innovation

Methods, ideas, or system contributions that make the work stand out.

LOO-map framework extends embedding maps
Detects unreliable points with diagnostic scores
Improves hyperparameter selection for embeddings
🔎 Similar Papers
No similar papers found.
Z
Zhexuan Liu
Department of Statistics, University of Wisconsin–Madison, Madison, WI, 53706, USA
R
Rong Ma
Department of Biostatistics, T.H. Chan School of Public Health, Harvard University, Boston, MA 02115, USA
Yiqiao Zhong
Yiqiao Zhong
Assistant Professor, University of Wisconsin--Madison
Interpretability of LLMsDeep Learning TheoryMachine LearningStatistics