🤖 AI Summary
This work investigates the learnability of high-dimensional embedding vectors from discrete data, focusing on how sample size, token frequency, and embedding–correlation strength jointly govern estimation accuracy. We propose a low-rank approximate Approximate Message Passing (AMP) algorithm grounded in a correlation–similarity coupled probabilistic model. This marks the first systematic integration of the AMP framework into the theoretical analysis of embedding estimation, enabling rigorous characterization of the phase transition boundary for estimation performance. Leveraging tools from high-dimensional statistical inference and random matrix theory, we derive precise quantitative relationships between embedding estimation error and key problem parameters. Extensive experiments on synthetic data and real-world text tasks validate our theoretical predictions, demonstrating substantial improvements in statistical efficiency and robustness—particularly in high-dimensional, sparse regimes.
📝 Abstract
Embeddings are a basic initial feature extraction step in many machine learning models, particularly in natural language processing. An embedding attempts to map data tokens to a low-dimensional space where similar tokens are mapped to vectors that are close to one another by some metric in the embedding space. A basic question is how well can such embedding be learned? To study this problem, we consider a simple probability model for discrete data where there is some"true"but unknown embedding where the correlation of random variables is related to the similarity of the embeddings. Under this model, it is shown that the embeddings can be learned by a variant of low-rank approximate message passing (AMP) method. The AMP approach enables precise predictions of the accuracy of the estimation in certain high-dimensional limits. In particular, the methodology provides insight on the relations of key parameters such as the number of samples per value, the frequency of the terms, and the strength of the embedding correlation on the probability distribution. Our theoretical findings are validated by simulations on both synthetic data and real text data.