Provable Accuracy Collapse in Embedding-Based Representations under Dimensionality Mismatch

📅 2026-05-05
📈 Citations: 0
Influential: 0
📄 PDF

career value

215K/year
🤖 AI Summary
This work establishes a fundamental trade-off between embedding dimensionality and representational fidelity from an information-theoretic perspective. It rigorously proves that when the embedding dimension falls below a constant fraction \( cD \) of the true data dimension \( D \) (for some universal constant \( c < 1 \)), any such embedding necessarily violates at least half of all triplet constraints. Moreover, under the Unique Games Conjecture (UGC), even when the true dimension is one, no polynomial-time algorithm can achieve better than the trivial 50% accuracy in preserving triplet orderings. By integrating tools from information theory, contrastive learning theory, and computational complexity, the study reveals an inherent limitation in low-dimensional embeddings and establishes the first computational complexity lower bound for dimensionality compression in representation learning.
📝 Abstract
Embedding-based representations in Euclidean space $\mathbb{R}^d$ are a cornerstone of modern machine learning, where a major goal is to use the \emph{smallest dimension} that faithfully captures data relations. In this work, we prove sharp dimension--accuracy tradeoffs and identify a fundamental information-theoretic limitation: unless the embedding dimension $d$ is chosen close to the ground-truth dimension $D$, accuracy undergoes a sudden collapse. Our main result shows that this phenomenon arises even in standard contrastive learning settings, where supervision is limited to a set of $m$ anchor--positive--negative triplets $(i,j,k)$ encoding distance comparisons $\mathrm{dist}(i,j) < \mathrm{dist}(i,k)$. Specifically, given triplets realizable by an unknown ground-truth embedding in $D$ dimensions, we prove that there exists constant $c < 1$, such that \emph{every embedding of dimension at most $cD$ violates half of the triplets}, yielding accuracy as low as a trivial one-dimensional solution that ignores the input. We complement our information-theoretic bounds with strong computational hardness results: under the Unique Games Conjecture, even if the given triplets are nearly realizable in $D=1$ dimension, no polynomial-time algorithm -- \textit{regardless of its dimension} -- can achieve accuracy above the trivial $50\%$ baseline.
Problem

Research questions and friction points this paper is trying to address.

embedding
dimensionality mismatch
accuracy collapse
contrastive learning
information-theoretic limitation
Innovation

Methods, ideas, or system contributions that make the work stand out.

dimensionality mismatch
accuracy collapse
contrastive learning
information-theoretic lower bound
computational hardness
🔎 Similar Papers
No similar papers found.