🤖 AI Summary
Existing manifold learning methods often distort cluster structures during high-dimensional data dimensionality reduction and suffer from poor scalability to large-scale datasets. To address these limitations, we propose a collaborative framework based on uniform landmark sampling: first, a low-dimensional skeletal manifold is constructed by theoretically grounded uniform sampling of representative landmarks; second, non-landmark points are embedded into this space via constrained locally linear embedding (CLLE), ensuring global structural consistency while drastically improving computational scalability. The method exhibits strong robustness—yielding stable embeddings even at low sampling rates—and generalizability across diverse domains. Extensive evaluation on synthetic benchmarks and real-world applications—including single-cell transcriptomics and ECG-based anomaly detection—demonstrates its effectiveness. Moreover, it maintains superior scalability and structural fidelity as dataset size and embedding dimension increase.
📝 Abstract
As a pivotal approach in machine learning and data science, manifold learning aims to uncover the intrinsic low-dimensional structure within complex nonlinear manifolds in high-dimensional space. By exploiting the manifold hypothesis, various techniques for nonlinear dimension reduction have been developed to facilitate visualization, classification, clustering, and gaining key insights. Although existing manifold learning methods have achieved remarkable successes, they still suffer from extensive distortions incurred in the global structure, which hinders the understanding of underlying patterns. Scalability issues also limit their applicability for handling large-scale data. Here, we propose a scalable manifold learning (scML) method that can manipulate large-scale and high-dimensional data in an efficient manner. It starts by seeking a set of landmarks to construct the low-dimensional skeleton of the entire data, and then incorporates the non-landmarks into the learned space based on the constrained locally linear embedding (CLLE). We empirically validated the effectiveness of scML on synthetic datasets and real-world benchmarks of different types, and applied it to analyze the single-cell transcriptomics and detect anomalies in electrocardiogram (ECG) signals. scML scales well with increasing data sizes and embedding dimensions, and exhibits promising performance in preserving the global structure. The experiments demonstrate notable robustness in embedding quality as the sample rate decreases.