🤖 AI Summary
Existing nonlinear dimensionality reduction methods, such as t-SNE and UMAP, lack out-of-sample mapping and inverse transformation capabilities, hindering quantitative validation. This work proposes a constrained autoencoder framework that distills static manifold embeddings into a generalizable encoder–decoder model, providing—for the first time—both forward and approximate inverse mappings for arbitrary manifold embeddings. The approach introduces a distortion metric based on reconstruction error, enabling principled hyperparameter tuning and method comparison. Evaluated across multiple benchmark and scientific datasets, the model not only achieves optimal embedding selection but also uncovers complex biological structures obscured in two-dimensional projections and effectively detects distributional shifts.
📝 Abstract
Low-dimensional embeddings are widely used as visual summaries of high-dimensional data and to enable downstream scientific discoveries. Yet, popular nonlinear dimension reduction methods, such as t-SNE and UMAP, are often selected based on visual appeal alone and without rigorous quantitative validation. A major reason is that manifold embeddings typically do not provide an out-of-sample map nor an inverse back to the original feature space; this makes held-out validation, the gold standard in supervised learning, all but impossible. To address these challenges, we develop a novel framework, MEDAL (Manifold Embedding Distillation via Autoencoder Learning), which distills a fitted manifold embedding into a reusable encoder--decoder model. MEDAL trains a constrained autoencoder whose bottleneck exactly matches any teacher embedding while the decoder reconstructs the original input; this yields an explicit map for new samples, an approximate inverse, and a pointwise reconstruction-based measure of distortion in the manifold space. This converts static manifold embeddings into models that can be evaluated on held-out data, enabling quantitative validation including comparing different dimension reduction methods as well as hyperparameter tuning. Across multiple benchmark and scientific case studies, we show that MEDAL enables held-out validation to determine optimal manifold embeddings and hyperparameters, reveals biologically coherent regions that are difficult to preserve in two dimensional embeddings, and detects distribution shift when new samples are mapped into a fixed reference manifold. MEDAL provides a general validation wrapper to any existing dimension reduction technique that will improve the rigor and