A Text is Worth Several Tokens: Text Embedding from LLMs Secretly Aligns Well with The Key Tokens

📅 2024-06-25
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the low efficiency of key information identification and semantic retrieval in text. We first discover that large language model (LLM) text embeddings naturally align with salient input tokens in the latent space—a phenomenon empirically validated across diverse model architectures, training paradigms, and embedding methods. Leveraging this insight, we propose a principal-component-guided alignment enhancement method that decomposes the embedding geometry and explicitly steers representations toward critical tokens. Building upon this, we design a lightweight sparse retrieval paradigm that retains only ~20% of token embeddings while achieving 80% of dense retrieval performance. Experiments across eight mainstream LLM embedders confirm the robustness of the alignment mechanism. Our findings provide an interpretable foundation for sparse retrieval and instruction-tuned embeddings, and advance the understanding of the intrinsic nature of semantic relevance.

Technology Category

Application Category

📝 Abstract
Text embeddings from large language models (LLMs) have achieved excellent results in tasks such as information retrieval, semantic textual similarity, etc. In this work, we show an interesting finding: when feeding a text into the LLM-based embedder, the obtained text embedding will be able to be aligned with the key tokens in the input text. We first fully analyze this phenomenon on eight LLM-based embedders and show that this phenomenon is universal and is not affected by model architecture, training strategy, and embedding method. With a deeper analysis, we find that the main change in embedding space between these embedders and their LLM backbones is in the first principal component. By adjusting the first principal component, we can align text embedding with the key tokens. Finally, we give several examples to demonstrate the vast application potential of this finding: (1) we propose a simple and practical sparse retrieval method based on the aligned tokens, which can achieve 80% of the dense retrieval effect of the same model while reducing the computation significantly; (2) we show that our findings provide a novel perspective to help understand novel technologies (e.g., instruction-following embedding) and fuzzy concepts (e.g., semantic relatedness vs. similarity) in this field.
Problem

Research questions and friction points this paper is trying to address.

Information Extraction
Text Mining
Semantic Understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Information Retrieval Efficiency
Instruction-following Embeddings
Word Similarity Judgement
🔎 Similar Papers
No similar papers found.