🤖 AI Summary
This paper addresses the degradation of model performance in high-dimensional data due to the “curse of dimensionality,” focusing on two fundamental causes: distance concentration and manifold effects. Through rigorous theoretical analysis and systematic numerical experiments across multiple distance metrics—including Minkowski, Chebyshev, and cosine distances—it is the first to decouple and quantify the independent impacts of these phenomena. The study reveals that distance concentration impairs nearest-neighbor search reliability, while manifold effects distort PCA eigenvalue spectra and exacerbate redundant feature representation. Crucially, it demonstrates that the three classical distance metrics converge asymptotically in high dimensions and lose discriminative power. Furthermore, the work elucidates the underlying mechanisms driving performance deterioration in regression, classification, and clustering tasks. These findings establish a unified theoretical foundation for optimizing dimensionality reduction strategies and designing robust, high-dimensional learning algorithms.
📝 Abstract
The characteristics of data like distribution and heterogeneity, become more complex and counterintuitive as dimensionality increases. This phenomenon is known as curse of dimensionality, where common patterns and relationships (e.g., internal pattern and boundary pattern) that hold in low-dimensional space may be invalid in higher-dimensional space. It leads to a decreasing performance for the regression, classification, or clustering models or algorithms. Curse of dimensionality can be attributed to many causes. In this paper, we first summarize the potential challenges associated with manipulating high-dimensional data, and explains the possible causes for the failure of regression, classification, or clustering tasks. Subsequently, we delve into two major causes of the curse of dimensionality, distance concentration, and manifold effect, by performing theoretical and empirical analyses. The results demonstrate that, as the dimensionality increases, nearest neighbor search (NNS) using three classical distance measurements, Minkowski distance, Chebyshev distance, and cosine distance, becomes meaningless. Meanwhile, the data incorporates more redundant features, and the variance contribution of principal component analysis (PCA) is skewed towards a few dimensions.