Causal2Vec: Improving Decoder-only LLMs as Versatile Embedding Models

📅 2025-07-31

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Pure decoder-based large language models (LLMs) struggle with general-purpose text embedding due to the inherent limitations of causal attention, which restricts contextual awareness to leftward tokens only. Method: We propose Causal2Vec—a lightweight, architecture-agnostic framework that requires no architectural modifications, bidirectional attention, or additional prompting. It employs a pretrained BERT encoder to generate context-aware tokens, prepends them to the input sequence, and pools the final-layer hidden states of both these context tokens and the sentence-final token to mitigate positional bias and better activate pretrained knowledge. Contribution/Results: Causal2Vec achieves state-of-the-art performance on the MTEB benchmark using only publicly available retrieval data—outperforming prior methods without domain-specific fine-tuning. It reduces sequence length by 85% and accelerates inference by 82%, significantly improving both efficiency and cross-task generalization.

Technology Category

Application Category

📝 Abstract

Decoder-only large language models (LLMs) are increasingly used to build embedding models that effectively encode the semantic information of natural language texts into dense vector representations for various embedding tasks. However, many existing methods primarily focus on removing the causal attention mask in LLMs to enable bidirectional attention, potentially undermining the model's ability to extract semantic information acquired during pretraining. Additionally, leading unidirectional approaches often rely on extra input text to overcome the inherent limitations of causal attention, inevitably increasing computational costs. In this work, we propose Causal2Vec, a general-purpose embedding model tailored to enhance the performance of decoder-only LLMs without altering their original architectures or introducing significant computational overhead. Specifically, we first employ a lightweight BERT-style model to pre-encode the input text into a single Contextual token, which is then prepended to the LLM's input sequence, allowing each token to capture contextualized information even without attending to future tokens. Furthermore, to mitigate the recency bias introduced by last-token pooling and help LLMs better leverage the semantic information encoded in the Contextual token, we concatenate the last hidden states of Contextual and EOS tokens as the final text embedding. In practice, Causal2Vec achieves state-of-the-art performance on the Massive Text Embeddings Benchmark (MTEB) among models trained solely on publicly available retrieval datasets, while reducing the required sequence length by up to 85% and inference time by up to 82% compared to best-performing methods.

Problem

Research questions and friction points this paper is trying to address.

Enhancing decoder-only LLMs for embedding tasks without architecture changes

Reducing computational costs in unidirectional embedding approaches

Mitigating recency bias in text embeddings for better semantic capture

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight BERT-style pre-encoding for contextual tokens

Concatenated hidden states to reduce recency bias

Maintains original LLM architecture with low overhead

🔎 Similar Papers

Causal Inference with Large Language Model: A Survey