On the Theoretical Limitations of Embedding-Based Retrieval

📅 2025-08-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper identifies a fundamental theoretical limitation of the vector embedding retrieval paradigm: even for simple queries, the embedding dimension strictly bounds the number of distinguishable document subsets—a bottleneck intrinsic to the representation and unresolvable via larger models or improved training data. Method: Drawing on statistical learning theory, the authors formally derive an upper bound on the expressive capacity of single-vector embeddings, then construct LIMIT, the first benchmark explicitly designed to probe this theoretical limit; they further propose a minimal empirical framework with k=2 and fully parameterized embeddings for boundary testing. Results: Experiments show that state-of-the-art embedding models underperform significantly relative to the derived theoretical ceiling on LIMIT, directly challenging the prevailing assumption that scaling model size alone suffices to overcome retrieval limitations—and thereby establishing a structural capacity ceiling for the single-vector paradigm.

Technology Category

Application Category

📝 Abstract
Vector embeddings have been tasked with an ever-increasing set of retrieval tasks over the years, with a nascent rise in using them for reasoning, instruction-following, coding, and more. These new benchmarks push embeddings to work for any query and any notion of relevance that could be given. While prior works have pointed out theoretical limitations of vector embeddings, there is a common assumption that these difficulties are exclusively due to unrealistic queries, and those that are not can be overcome with better training data and larger models. In this work, we demonstrate that we may encounter these theoretical limitations in realistic settings with extremely simple queries. We connect known results in learning theory, showing that the number of top-k subsets of documents capable of being returned as the result of some query is limited by the dimension of the embedding. We empirically show that this holds true even if we restrict to k=2, and directly optimize on the test set with free parameterized embeddings. We then create a realistic dataset called LIMIT that stress tests models based on these theoretical results, and observe that even state-of-the-art models fail on this dataset despite the simple nature of the task. Our work shows the limits of embedding models under the existing single vector paradigm and calls for future research to develop methods that can resolve this fundamental limitation.
Problem

Research questions and friction points this paper is trying to address.

Theoretical limitations of embedding-based retrieval in realistic settings
Constraints on top-k document subsets due to embedding dimension
Failure of state-of-the-art models on simple retrieval tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Theoretical limitations of vector embeddings
Dimension restricts top-k document subsets
State-of-the-art models fail simple tasks
🔎 Similar Papers
No similar papers found.