🤖 AI Summary
This work investigates the minimal embedding dimension (MED) required to represent all subsets of size at most k for exact Top-k retrieval in embedded spaces. Combining combinatorial analysis, vector space embedding theory, and numerical simulations, the study examines the dimensional requirements for embedding m elements and their k-element subsets under ℓ₂ distance, inner product, and cosine similarity. It rigorously establishes for the first time that ℝ²ᵏ suffices for conflict-free embeddings, demonstrating that the primary bottleneck in Top-k retrieval stems from learnability rather than geometric constraints. Both theoretical analysis and empirical results reveal that the MED scales logarithmically with the number of elements m, validating the feasibility of representing subsets via the centroid of their constituent element embeddings and establishing a theoretical foundation for exact Top-k retrieval in low-dimensional spaces—such as 2k dimensions—under ideal conditions.
📝 Abstract
This paper studies the minimal dimension required to embed subset memberships ($m$ elements and ${m\choose k}$ subsets of at most $k$ elements) into vector spaces, denoted as Minimal Embeddable Dimension (MED). The tight bounds of MED are derived theoretically and supported empirically for various notions of"distances"or"similarities,"including the $\ell_2$ metric, inner product, and cosine similarity. In addition, we conduct numerical simulation in a more achievable setting, where the ${m\choose k}$ subset embeddings are chosen as the centroid of the embeddings of the contained elements. Our simulation easily realizes a logarithmic dependency between the MED and the number of elements to embed. These findings imply that embedding-based retrieval limitations stem primarily from learnability challenges, not geometric constraints, guiding future algorithm design.