Semantic Shift: the Fundamental Challenge in Text Embedding and Retrieval

📅 2026-03-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the issue of semantic drift in Transformer-based text embeddings, which leads to embedding collapse and degraded retrieval performance, yet its underlying causes remain poorly understood. The study formalizes semantic drift for the first time as a quantifiable metric that integrates local semantic evolution and global semantic dispersion. Through theoretical analysis and controlled experiments across multiple corpora, the authors demonstrate how this drift induces representation smoothing and diminishes discriminative capacity. Empirical results show that the proposed metric strongly correlates with embedding concentration and effectively predicts retrieval performance degradation, whereas text length lacks such predictive power. This framework not only unifies the explanation of embedding collapse but also offers a novel perspective for evaluating embedding quality.

Technology Category

Application Category

📝 Abstract
Transformer-based embedding models rely on pooling to map variable-length text into a single vector, enabling efficient similarity search but also inducing well-known geometric pathologies such as anisotropy and length-induced embedding collapse. Existing accounts largely describe \emph{what} these pathologies look like, yet provide limited insight into \emph{when} and \emph{why} they harm downstream retrieval. In this work, we argue that the missing causal factor is \emph{semantic shift}: the intrinsic, structured evolution and dispersion of semantics within a text. We first present a theoretical analysis of \emph{semantic smoothing} in Transformer embeddings: as the semantic diversity among constituent sentences increases, the pooled representation necessarily shifts away from every individual sentence embedding, yielding a smoothed and less discriminative vector. Building on this foundation, we formalize semantic shift as a computable measure integrating local semantic evolution and global semantic dispersion. Through controlled experiments across corpora and multiple embedding models, we show that semantic shift aligns closely with the severity of embedding concentration and predicts retrieval degradation, whereas text length alone does not. Overall, semantic shift offers a unified and actionable lens for understanding embedding collapse and for diagnosing when anisotropy becomes harmful.
Problem

Research questions and friction points this paper is trying to address.

semantic shift
text embedding
retrieval degradation
embedding collapse
anisotropy
Innovation

Methods, ideas, or system contributions that make the work stand out.

semantic shift
embedding collapse
semantic smoothing
anisotropy
text embedding
🔎 Similar Papers
No similar papers found.