From Retrieval to Generation: Comparing Different Approaches

📅 2025-02-27

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

This study systematically investigates the trade-off between retrieval accuracy and generation flexibility in knowledge-intensive tasks—namely, open-domain question answering, document re-ranking, and retrieval-augmented language modeling. Within a unified evaluation framework, it conducts the first cross-model comparison of representative retrieval-based (BM25), generative (GPT-4-o), and hybrid models (e.g., RAG) across three benchmarks: Natural Questions (NQ), BEIR, and WikiText-103, using standardized metrics including nDCG@10 and top-1 accuracy. Results show that DPR achieves 50.17% top-1 accuracy on NQ; hybrid RAG improves BEIR’s average nDCG@10 from 43.42 to 52.59; and BM25 yields the lowest perplexity on WikiText-103. Based on these findings, the study proposes empirically grounded model selection criteria aligned with task-specific semantic requirements and factual constraint characteristics, offering both methodological guidance and practical evidence for robust RAG system design.

Technology Category

Application Category

📝 Abstract

Knowledge-intensive tasks, particularly open-domain question answering (ODQA), document reranking, and retrieval-augmented language modeling, require a balance between retrieval accuracy and generative flexibility. Traditional retrieval models such as BM25 and Dense Passage Retrieval (DPR), efficiently retrieve from large corpora but often lack semantic depth. Generative models like GPT-4-o provide richer contextual understanding but face challenges in maintaining factual consistency. In this work, we conduct a systematic evaluation of retrieval-based, generation-based, and hybrid models, with a primary focus on their performance in ODQA and related retrieval-augmented tasks. Our results show that dense retrievers, particularly DPR, achieve strong performance in ODQA with a top-1 accuracy of 50.17% on NQ, while hybrid models improve nDCG@10 scores on BEIR from 43.42 (BM25) to 52.59, demonstrating their strength in document reranking. Additionally, we analyze language modeling tasks using WikiText-103, showing that retrieval-based approaches like BM25 achieve lower perplexity compared to generative and hybrid methods, highlighting their utility in retrieval-augmented generation. By providing detailed comparisons and practical insights into the conditions where each approach excels, we aim to facilitate future optimizations in retrieval, reranking, and generative models for ODQA and related knowledge-intensive applications.

Problem

Research questions and friction points this paper is trying to address.

Balancing retrieval accuracy and generative flexibility

Evaluating models for open-domain question answering

Comparing retrieval, generation, and hybrid model performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid models enhance document reranking

Dense retrievers excel in ODQA accuracy

Retrieval-based methods reduce language model perplexity

🔎 Similar Papers

No similar papers found.