🤖 AI Summary
To address information redundancy and fragmented cross-source reasoning in multi-hop question answering caused by relevance-based retrieval in conventional RAG, this paper proposes an iterative retrieval-augmented generation (RAG) framework. It introduces the Vendi Score—used here for the first time—to quantify semantic diversity of retrieved passages and couples it with an LLM-based discriminator to dynamically balance diversity against answer quality, enabling closed-loop retrieval-generation optimization. Evaluated on three benchmarks including HotpotQA, the method significantly improves multi-hop reasoning accuracy, achieving up to a 4.2% absolute gain over Adaptive-RAG; consistent improvements are observed across GPT-3.5, GPT-4, and GPT-4o-mini. Key contributions are: (1) a differentiable diversity metric guiding retrieval optimization; (2) an LLM-guided dynamic trade-off mechanism between diversity and answer fidelity; and (3) empirical validation of iterative RAG’s efficacy for complex multi-hop reasoning tasks.
📝 Abstract
Retrieval-augmented generation (RAG) enhances large language models (LLMs) for domain-specific question-answering (QA) tasks by leveraging external knowledge sources. However, traditional RAG systems primarily focus on relevance-based retrieval and often struggle with redundancy, especially when reasoning requires connecting information from multiple sources. This paper introduces Vendi-RAG, a framework based on an iterative process that jointly optimizes retrieval diversity and answer quality. This joint optimization leads to significantly higher accuracy for multi-hop QA tasks. Vendi-RAG leverages the Vendi Score (VS), a flexible similarity-based diversity metric, to promote semantic diversity in document retrieval. It then uses an LLM judge that evaluates candidate answers, generated after a reasoning step, and outputs a score that the retriever uses to balance relevance and diversity among the retrieved documents during each iteration. Experiments on three challenging datasets -- HotpotQA, MuSiQue, and 2WikiMultiHopQA -- demonstrate Vendi-RAG's effectiveness in multi-hop reasoning tasks. The framework achieves significant accuracy improvements over traditional single-step and multi-step RAG approaches, with accuracy increases reaching up to +4.2% on HotpotQA, +4.1% on 2WikiMultiHopQA, and +1.3% on MuSiQue compared to Adaptive-RAG, the current best baseline. The benefits of Vendi-RAG are even more pronounced as the number of retrieved documents increases. Finally, we evaluated Vendi-RAG across different LLM backbones, including GPT-3.5, GPT-4, and GPT-4o-mini, and observed consistent improvements, demonstrating that the framework's advantages are model-agnostic.