On the Emergence of Linear Analogies in Word Embeddings

πŸ“… 2025-05-24
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
The theoretical origin of linear analogical structures (e.g., king βˆ’ man + woman β‰ˆ queen) in word embeddings has remained elusive. Method: We propose the first analytically tractable generative model for binary semantic attributes, integrating matrix spectral analysis, logarithmic transformation, and co-occurrence statistics to formalize how analogies emerge from the spectral properties of co-occurrence probability matrices and latent semantic attribute generation. Contribution/Results: We theoretically prove that analogical structure robustly emerges under dimensionality reduction, log-transformation, and data sparsification. The model precisely reproduces the saturation dynamics of analogy performance, achieving high alignment between theoretical predictions and empirical results on Wiki corpora and the Mikolov benchmark (mean absolute error < 1.2%). This work establishes the first unified, verifiable, generative explanation for the analogical capability of word vectors.

Technology Category

Application Category

πŸ“ Abstract
Models such as Word2Vec and GloVe construct word embeddings based on the co-occurrence probability $P(i,j)$ of words $i$ and $j$ in text corpora. The resulting vectors $W_i$ not only group semantically similar words but also exhibit a striking linear analogy structure -- for example, $W_{ ext{king}} - W_{ ext{man}} + W_{ ext{woman}} approx W_{ ext{queen}}$ -- whose theoretical origin remains unclear. Previous observations indicate that this analogy structure: (i) already emerges in the top eigenvectors of the matrix $M(i,j) = P(i,j)/P(i)P(j)$, (ii) strengthens and then saturates as more eigenvectors of $M (i, j)$, which controls the dimension of the embeddings, are included, (iii) is enhanced when using $log M(i,j)$ rather than $M(i,j)$, and (iv) persists even when all word pairs involved in a specific analogy relation (e.g., king-queen, man-woman) are removed from the corpus. To explain these phenomena, we introduce a theoretical generative model in which words are defined by binary semantic attributes, and co-occurrence probabilities are derived from attribute-based interactions. This model analytically reproduces the emergence of linear analogy structure and naturally accounts for properties (i)-(iv). It can be viewed as giving fine-grained resolution into the role of each additional embedding dimension. It is robust to various forms of noise and agrees well with co-occurrence statistics measured on Wikipedia and the analogy benchmark introduced by Mikolov et al.
Problem

Research questions and friction points this paper is trying to address.

Explains linear analogy structure in word embeddings
Analyzes co-occurrence probability's role in embeddings
Proposes generative model for semantic attribute interactions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses co-occurrence probabilities for word embeddings
Introduces theoretical generative model with attributes
Explains linear analogy structure in embeddings
πŸ”Ž Similar Papers
No similar papers found.
D
D. Korchinski
Department of Physics, Ecole Polytechnique FΓ©dΓ©rale de Lausanne
D
Dhruva Karkada
Department of Physics, UC Berkeley
Yasaman Bahri
Yasaman Bahri
Research Scientist, Google DeepMind (formerly Brain)
Machine learningCondensed Matter TheoryQuantum MaterialsDeep Learning Foundations
Matthieu Wyart
Matthieu Wyart
Professor of Physics, Johns Hopkins
physics