Learning More Effective Representations for Dense Retrieval through Deliberate Thinking Before Search

📅 2025-02-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limitation of single-embedding document representations in dense retrieval—namely, their inability to capture multi-faceted semantic aspects—this paper proposes DEBATER, a novel framework featuring Chain-of-Deliberation: a first-of-its-kind mechanism that iteratively refines document representations through multi-step reasoning. DEBATER further incorporates Self-Distillation to consolidate critical reasoning steps into a unified, robust, and information-rich strong embedding. By deeply integrating large language model capabilities, chain-of-thought reasoning, and self-distillation, DEBATER achieves significant improvements over state-of-the-art dense retrievers (e.g., ColBERT, ANCE) on major benchmarks including MSMARCO and BEIR, notably enhancing Recall@10 and MRR. Moreover, it demonstrates superior generalization and robustness across diverse domains and query distributions. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract
Recent dense retrievers usually thrive on the emergency capabilities of Large Language Models (LLMs), using them to encode queries and documents into an embedding space for retrieval. These LLM-based dense retrievers have shown promising performance across various retrieval scenarios. However, relying on a single embedding to represent documents proves less effective in capturing different perspectives of documents for matching. In this paper, we propose Deliberate Thinking based Dense Retriever (DEBATER), which enhances these LLM-based retrievers by enabling them to learn more effective document representations through a step-by-step thinking process. DEBATER introduces the Chain-of-Deliberation mechanism to iteratively optimize document representations using a continuous chain of thought. To consolidate information from various thinking steps, DEBATER also incorporates the Self Distillation mechanism, which identifies the most informative thinking steps and integrates them into a unified text embedding. Experimental results show that DEBATER significantly outperforms existing methods across several retrieval benchmarks, demonstrating superior accuracy and robustness. All codes are available at https://github.com/OpenBMB/DEBATER.
Problem

Research questions and friction points this paper is trying to address.

Enhancing dense retrieval effectiveness
Optimizing document representations iteratively
Integrating informative thinking steps
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-Deliberation mechanism optimizes representations
Self Distillation integrates informative thinking steps
DEBATER enhances LLM-based retrievers effectively
🔎 Similar Papers
No similar papers found.
Y
Yifan Ji
Department of Computer Science and Technology, Northeastern University, China
Zhipeng Xu
Zhipeng Xu
Northeastern University
NLPInformation Retrieval
Zhenghao Liu
Zhenghao Liu
Northeastern University
NLPInformation Retrieval
Yukun Yan
Yukun Yan
Tsinghua University
Large Language Model
Shi Yu
Shi Yu
Tsinghua University
LLMRAGInformation RetrievalNatural Language Processing
Yishan Li
Yishan Li
OpenBMB
Natural Language ProcessingLagre Language ModelInformation Retrieval
Z
Zhiyuan Liu
Department of Computer Science and Technology, Institute for AI, Tsinghua University, China, Beijing National Research Center for Information Science and Technology, China
Y
Yu Gu
Department of Computer Science and Technology, Northeastern University, China
G
Ge Yu
Department of Computer Science and Technology, Northeastern University, China
Maosong Sun
Maosong Sun
Professor of Computer Science and Technology, Tsinghua University
Natural Language ProcessingArtificial IntelligenceSocial Computing