Revela: Dense Retriever Learning via Language Modeling

📅 2025-06-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the scarcity and high annotation cost of labeled query-document pairs in specialized domains (e.g., code), this paper proposes Revela—a novel framework that reframes dense retriever training as a self-supervised language modeling task for the first time. Specifically, it jointly optimizes the retriever and language model via next-token prediction conditioned on cross-document contexts. Its core innovation lies in modeling retrieval as token-level dependency learning and introducing a retrieval-similarity-weighted in-batch cross-document attention mechanism, enabling end-to-end self-supervised joint training. The method further incorporates multi-scale backbone network ensembling. On the BEIR and CoIR benchmarks, Revela achieves absolute gains of +5.2% (+18.3% relative) and +5.6% (+14.4% relative) in NDCG@10, respectively. Moreover, performance consistently improves with increasing model scale, demonstrating both effectiveness and scalability.

Technology Category

Application Category

📝 Abstract
Dense retrievers play a vital role in accessing external and specialized knowledge to augment language models (LMs). Training dense retrievers typically requires annotated query-document pairs, which are costly and hard to obtain in specialized domains such as code-motivating growing interest in self-supervised retriever learning. Since LMs are trained to capture token-level dependencies through a self-supervised learning objective (i.e., next-token prediction), we can analogously cast retrieval as learning dependencies among chunks of tokens. This analogy naturally leads to the question: How can we adapt self-supervised learning objectives in the spirit of language modeling to train retrievers? To answer this question, we introduce Revela, a unified and scalable training framework for self-supervised retriever learning via language modeling. Revela models semantic dependencies among documents by conditioning next-token prediction on both local and cross-document context through an in-batch attention mechanism. This attention is weighted by retriever-computed similarity scores, enabling the retriever to be optimized as part of language modeling. We evaluate Revela on both general-domain (BEIR) and domain-specific (CoIR) benchmarks across various retriever backbones. At a comparable parameter scale, Revela outperforms the previous best method with absolute improvements of 5.2 % (18.3 % relative) and 5.6 % (14.4 % relative) on NDCG@10, respectively, underscoring its effectiveness. Performance increases with model size, highlighting both the scalability of our approach and its promise for self-supervised retriever learning.
Problem

Research questions and friction points this paper is trying to address.

Self-supervised learning for dense retriever training
Reducing reliance on annotated query-document pairs
Adapting language modeling objectives to retrieval tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised retriever learning via language modeling
In-batch attention mechanism for document dependencies
Retriever optimization through similarity-weighted attention
🔎 Similar Papers
No similar papers found.