Enhancing Lexicon-Based Text Embeddings with Large Language Models

📅 2025-01-16

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

Existing dense embedding methods suffer from information redundancy, limited semantic perspectives, and coarse token-level granularity. This paper proposes LENS—the first dictionary-style text embedding framework powered by large language models (LLMs). LENS constructs an interpretable semantic dictionary via token embedding clustering, where each embedding dimension explicitly corresponds to a distinct semantic cluster. To mitigate the unidirectional modeling constraint of causal LLMs, it introduces bidirectional attention and employs multi-strategy pooling for fine-grained semantic aggregation. LENS achieves the first LLM-driven dictionary-level embedding compression and distillation, outperforming mainstream dense models on the MTEB benchmark with comparable parameter counts. Moreover, when fused with conventional dense embeddings, LENS establishes new state-of-the-art performance on the BEIR retrieval benchmark.

Technology Category

Application Category

📝 Abstract

Recent large language models (LLMs) have demonstrated exceptional performance on general-purpose text embedding tasks. While dense embeddings have dominated related research, we introduce the first Lexicon-based EmbeddiNgS (LENS) leveraging LLMs that achieve competitive performance on these tasks. Regarding the inherent tokenization redundancy issue and unidirectional attention limitations in traditional causal LLMs, LENS consolidates the vocabulary space through token embedding clustering, and investigates bidirectional attention and various pooling strategies. Specifically, LENS simplifies lexicon matching by assigning each dimension to a specific token cluster, where semantically similar tokens are grouped together, and unlocking the full potential of LLMs through bidirectional attention. Extensive experiments demonstrate that LENS outperforms dense embeddings on the Massive Text Embedding Benchmark (MTEB), delivering compact feature representations that match the sizes of dense counterparts. Notably, combining LENSE with dense embeddings achieves state-of-the-art performance on the retrieval subset of MTEB (i.e. BEIR).

Problem

Research questions and friction points this paper is trying to address.

Dense Embedding Methods

Information Wastage

Limited Perspective

Innovation

Methods, ideas, or system contributions that make the work stand out.

LENS

Bidirectional Perspective

Resource-efficient Embeddings

🔎 Similar Papers

No similar papers found.