Making Large Language Models A Better Foundation For Dense Retrieval

📅 2023-12-24
🏛️ arXiv.org
📈 Citations: 73
Influential: 6
📄 PDF
🤖 AI Summary
Large language models (LLMs) suffer from inadequate semantic encoding for dense retrieval due to an inherent conflict between their autoregressive pretraining paradigm and the discriminative embedding representations required for retrieval. To address this, we propose LLaRA, a lightweight, post-hoc adapter framework. Its core innovation lies in two auxiliary pretraining tasks—Embedding-Based Autoencoding (EBAE) and Embedding-Based Autoregressive (EBAR)—applied atop LLaMA-2-7B using Wikipedia data to align its embedding space with retrieval objectives. Crucially, LLaRA requires no architectural modification or full-parameter fine-tuning. Evaluated on standard benchmarks including MSMARCO and BEIR, it achieves state-of-the-art or near-state-of-the-art performance as a retriever encoder. The model and code are publicly released in the BGE repository.
📝 Abstract
Dense retrieval needs to learn discriminative text embeddings to represent the semantic relationship between query and document. It may benefit from the using of large language models (LLMs), given LLMs' strong capability on semantic understanding. However, the LLMs are pre-trained by text generation tasks, whose working pattern is completely different from representing texts as embeddings. As a result, it is imperative to study how to adapt LLMs properly so that they can be effectively initialized as the backbone encoder for dense retrieval. In this paper, we propose a novel approach, called LLaRA (LLM adapted for dense RetrievAl), which works as a post-hoc adaptation of LLM for the dense retrieval application. LLaRA consists of two pretext tasks: EBAE (Embedding-Based Auto-Encoding) and EBAR (Embedding-Based Auto-Regression), where the text embeddings from LLM are used to reconstruct the tokens for the input sentence and predict the tokens for the next sentence, respectively. LLaRA turns out to be simple, lightweight, and highly effective. It is applied to adapt LLaMA-2-7B (base) on the Wikipedia corpus, where it substantially improves the model's fine-tuned performances on a variety of dense retrieval benchmarks, like MSMARCO and BEIR. Our model and code will be made publicly available at BGE repository.
Problem

Research questions and friction points this paper is trying to address.

Adapting auto-regressive LLMs for discriminative dense retrieval tasks
Bridging the gap between LLM semantics and text embedding representation
Enabling effective LLM initialization as backbone encoders for retrieval
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unsupervised adaptation of LLMs for dense retrieval
Uses Embedding-Based Auto-Encoding reconstruction task
Uses Embedding-Based Auto-Regression prediction task
🔎 Similar Papers
No similar papers found.