🤖 AI Summary
This paper investigates the theoretical limits of preserving neighborhood structure in low-dimensional visualizations of high-dimensional data. Addressing the fundamental question—“Can neighborhood relations of high-dimensional data be reliably preserved in constant-dimensional spaces (e.g., 2D/3D)?”—we introduce the *doubling dimension* as a geometric complexity measure for embedding difficulty. Leveraging graph embedding theory, metric space analysis, and planted cluster models, we systematically characterize visualization compressibility across graph classes. We prove: (i) almost all $n$-vertex graphs require $Omega(log n)$ dimensions to maintain neighborhood separability; (ii) sparse regular graphs still necessitate $Omega(log n / log log n)$ dimensions; and (iii) in normed spaces, nearly all graphs require $Theta(n)$ dimensions. This work provides the first information-theoretic and geometric characterization of intrinsic dimensional bottlenecks in common dimensionality reduction techniques (e.g., t-SNE, UMAP), establishing rigorous theoretical foundations for visualization design and interpretation.
📝 Abstract
To what extent is it possible to visualize high-dimensional datasets in a two- or three-dimensional space? We reframe this question in terms of embedding $n$-vertex graphs (representing the neighborhood structure of the input points) into metric spaces of low doubling dimension $d$, in such a way that maintains the separation between neighbors and non-neighbors. This seemingly lax embedding requirement is surprisingly difficult to satisfy. Our investigation shows that an overwhelming fraction of graphs require $d = Ω(log n)$. Even when considering sparse regular graphs, the situation does not improve, as an overwhelming fraction of such graphs requires $d= Ω(log n / loglog n)$. The landscape changes dramatically when embedding into normed spaces. In particular, all but a vanishing fraction of graphs demand $d=Θ(n)$. Finally, we study the implications of these results for visualizing data with intrinsic cluster structure. We find that graphs produced from a planted partition model with $k$ clusters on $n$ points typically require $d=Ω(log n)$, even when the cluster structure is salient. These results challenge the aspiration that constant-dimensional visualizations can faithfully preserve neighborhood structure.