🤖 AI Summary
This paper addresses the lack of rigorous methodological guidance for intrinsic dimension estimation in high-dimensional data. We systematically survey and categorize mainstream algorithms grounded in local affine structure, parametric distributional assumptions, and topological invariance. Within a unified analytical framework and through extensive numerical experiments, we conduct the first comprehensive comparative evaluation of maximum likelihood estimation (MLE), PCA-based tangent space estimation, manifold neighborhood methods, and statistical fitting techniques across varying curvature, noise levels, and sample sizes. Results reveal that most methods exhibit high sensitivity to hyperparameters, suffer from overfitting, and experience sharp declines in accuracy, robustness, and generalizability in high dimensions. Our core contribution lies in rigorously characterizing the applicability boundaries of existing approaches, identifying nonlinear geometric structure and finite-sample effects as primary determinants of estimation reliability. This work provides both empirical evidence and theoretical guidance for principled algorithm selection and future methodological improvements in intrinsic dimension estimation.
📝 Abstract
It is a standard assumption that datasets in high dimension have an internal structure which means that they in fact lie on, or near, subsets of a lower dimension. In many instances it is important to understand the real dimension of the data, hence the complexity of the dataset at hand. A great variety of dimension estimators have been developed to find the intrinsic dimension of the data but there is little guidance on how to reliably use these estimators.
This survey reviews a wide range of dimension estimation methods, categorising them by the geometric information they exploit: tangential estimators which detect a local affine structure; parametric estimators which rely on dimension-dependent probability distributions; and estimators which use topological or metric invariants.
The paper evaluates the performance of these methods, as well as investigating varying responses to curvature and noise. Key issues addressed include robustness to hyperparameter selection, sample size requirements, accuracy in high dimensions, precision, and performance on non-linear geometries. In identifying the best hyperparameters for benchmark datasets, overfitting is frequent, indicating that many estimators may not generalise well beyond the datasets on which they have been tested.