Estimation of embedding vectors in high dimensions

📅 2023-12-12

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

🤖 AI Summary

This work investigates the learnability of high-dimensional embedding vectors from discrete data, focusing on how sample size, token frequency, and embedding–correlation strength jointly govern estimation accuracy. We propose a low-rank approximate Approximate Message Passing (AMP) algorithm grounded in a correlation–similarity coupled probabilistic model. This marks the first systematic integration of the AMP framework into the theoretical analysis of embedding estimation, enabling rigorous characterization of the phase transition boundary for estimation performance. Leveraging tools from high-dimensional statistical inference and random matrix theory, we derive precise quantitative relationships between embedding estimation error and key problem parameters. Extensive experiments on synthetic data and real-world text tasks validate our theoretical predictions, demonstrating substantial improvements in statistical efficiency and robustness—particularly in high-dimensional, sparse regimes.

📝 Abstract

Embeddings are a basic initial feature extraction step in many machine learning models, particularly in natural language processing. An embedding attempts to map data tokens to a low-dimensional space where similar tokens are mapped to vectors that are close to one another by some metric in the embedding space. A basic question is how well can such embedding be learned? To study this problem, we consider a simple probability model for discrete data where there is some"true"but unknown embedding where the correlation of random variables is related to the similarity of the embeddings. Under this model, it is shown that the embeddings can be learned by a variant of low-rank approximate message passing (AMP) method. The AMP approach enables precise predictions of the accuracy of the estimation in certain high-dimensional limits. In particular, the methodology provides insight on the relations of key parameters such as the number of samples per value, the frequency of the terms, and the strength of the embedding correlation on the probability distribution. Our theoretical findings are validated by simulations on both synthetic data and real text data.

Problem

Research questions and friction points this paper is trying to address.

Estimating high-dimensional embedding vectors accurately

Learning embeddings via low-rank AMP method

Analyzing parameter impacts on embedding estimation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Low-rank AMP for embedding estimation

High-dimensional accuracy predictions

Parameter relation insights

🔎 Similar Papers

No similar papers found.

Authors to Follow