Face the Facts! Evaluating RAG-based Fact-checking Pipelines in Realistic Settings

📅 2024-12-19
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically evaluates retrieval-augmented generation (RAG) for automated fact-checking in realistic settings, focusing on truth adjudication of stylistically complex claims and generation of concise verdicts, while assessing robustness across heterogeneous yet trustworthy knowledge bases. Method: We propose the first end-to-end framework integrating LLM-driven retrieval, multi-scale language models (zero-/few-shot and fine-tuned), and hybrid human-automated evaluation. Benchmarking is conducted across four dimensions: faithfulness, context adherence, sentiment alignment, and informativeness. Contribution/Results: Key findings reveal that LLM-based retrievers substantially outperform traditional methods but remain sensitive to knowledge heterogeneity; larger LMs improve adjudication faithfulness, whereas smaller LMs excel in context adherence; zero-/few-shot generators yield more informative verdicts, while fine-tuned models better capture sentiment nuance. The results expose complementary trade-offs between model scale and specialized capabilities, offering principled guidance for RAG deployment in fact-checking.

Technology Category

Application Category

📝 Abstract
Natural Language Processing and Generation systems have recently shown the potential to complement and streamline the costly and time-consuming job of professional fact-checkers. In this work, we lift several constraints of current state-of-the-art pipelines for automated fact-checking based on the Retrieval-Augmented Generation (RAG) paradigm. Our goal is to benchmark, under more realistic scenarios, RAG-based methods for the generation of verdicts - i.e., short texts discussing the veracity of a claim - evaluating them on stylistically complex claims and heterogeneous, yet reliable, knowledge bases. Our findings show a complex landscape, where, for example, LLM-based retrievers outperform other retrieval techniques, though they still struggle with heterogeneous knowledge bases; larger models excel in verdict faithfulness, while smaller models provide better context adherence, with human evaluations favouring zero-shot and one-shot approaches for informativeness, and fine-tuned models for emotional alignment.
Problem

Research questions and friction points this paper is trying to address.

Evaluating RAG-based fact-checking pipelines in realistic settings
Benchmarking verdict generation on complex claims and heterogeneous knowledge
Assessing performance trade-offs between different model configurations
Innovation

Methods, ideas, or system contributions that make the work stand out.

RAG-based automated fact-checking pipeline
Evaluating LLM retrievers on complex claims
Benchmarking verdict generation with heterogeneous knowledge bases
🔎 Similar Papers
No similar papers found.