Provable Accuracy Collapse in Embedding-Based Representations under Dimensionality Mismatch

📅 2026-05-05

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work establishes a fundamental trade-off between embedding dimensionality and representational fidelity from an information-theoretic perspective. It rigorously proves that when the embedding dimension falls below a constant fraction $ cD $ of the true data dimension $ D $ (for some universal constant $ c < 1 $), any such embedding necessarily violates at least half of all triplet constraints. Moreover, under the Unique Games Conjecture (UGC), even when the true dimension is one, no polynomial-time algorithm can achieve better than the trivial 50% accuracy in preserving triplet orderings. By integrating tools from information theory, contrastive learning theory, and computational complexity, the study reveals an inherent limitation in low-dimensional embeddings and establishes the first computational complexity lower bound for dimensionality compression in representation learning.

📝 Abstract

Embedding-based representations in Euclidean space $\mathbb{R}^d$ are a cornerstone of modern machine learning, where a major goal is to use the \emph{smallest dimension} that faithfully captures data relations. In this work, we prove sharp dimension--accuracy tradeoffs and identify a fundamental information-theoretic limitation: unless the embedding dimension $d$ is chosen close to the ground-truth dimension $D$, accuracy undergoes a sudden collapse. Our main result shows that this phenomenon arises even in standard contrastive learning settings, where supervision is limited to a set of $m$ anchor--positive--negative triplets $(i,j,k)$ encoding distance comparisons $\mathrm{dist}(i,j) < \mathrm{dist}(i,k)$. Specifically, given triplets realizable by an unknown ground-truth embedding in $D$ dimensions, we prove that there exists constant $c < 1$, such that \emph{every embedding of dimension at most $cD$ violates half of the triplets}, yielding accuracy as low as a trivial one-dimensional solution that ignores the input. We complement our information-theoretic bounds with strong computational hardness results: under the Unique Games Conjecture, even if the given triplets are nearly realizable in $D=1$ dimension, no polynomial-time algorithm -- \textit{regardless of its dimension} -- can achieve accuracy above the trivial $50\%$ baseline.

Problem

Research questions and friction points this paper is trying to address.

embedding

dimensionality mismatch

accuracy collapse

contrastive learning

information-theoretic limitation

Innovation

Methods, ideas, or system contributions that make the work stand out.

dimensionality mismatch

accuracy collapse

contrastive learning