🤖 AI Summary
Large language models (LLMs) used as text encoders in contrastive learning suffer from an inherent mismatch: their autoregressive, next-token conditional embedding objective conflicts with the global semantic alignment requirement of contrastive learning, leading to suboptimal pretraining knowledge utilization and low learning efficiency. Method: We propose AutoRegEmbed—the first framework to explicitly incorporate the LLM’s conditional probability distribution into contrastive learning. It jointly optimizes three objectives: (i) embedding space information compression, (ii) alignment of positive-pair conditional distributions, and (iii) suppression of negative-sample generation probabilities—thereby unifying semantic representation learning with autoregressive modeling. Contribution/Results: AutoRegEmbed introduces a conditional distribution contrastive loss, an embedding compression module, and a negative-suppression mechanism. On multi-task retrieval benchmarks, it significantly outperforms baselines including SimCSE and CoSENT; with equivalent training data, it matches state-of-the-art performance, empirically validating the efficacy of explicitly modeling autoregressive structure in contrastive text embedding.
📝 Abstract
A new trend uses LLMs as dense text encoders via contrastive learning. However, since LLM embeddings predict the probability distribution of the next token, they are inherently generative and distributive, conflicting with contrastive learning, which requires embeddings to capture full-text semantics and align via cosine similarity. This discrepancy hinders the full utilization of LLMs' pre-training capabilities, resulting in inefficient learning. In response to this issue, we propose AutoRegEmbed, a new contrastive learning method built on embedding conditional probability distributions, which integrates two core tasks: information compression and conditional distribution alignment. The information compression task encodes text into the embedding space, ensuring that the embedding vectors capture global semantics. The conditional distribution alignment task focuses on aligning text embeddings with positive samples embeddings by leveraging the conditional distribution of embeddings while simultaneously reducing the likelihood of generating negative samples from text embeddings, thereby achieving embedding alignment and uniformity. Experimental results demonstrate that our method significantly outperforms traditional contrastive learning approaches and achieves performance comparable to state-of-the-art models when using the same amount of data.