🤖 AI Summary
Existing static RAG factuality evaluation benchmarks are prone to saturation and incur high maintenance costs, limiting their ability to continuously assess system robustness. This work proposes the IRB framework, which introduces a structured generation mechanism that uniquely integrates factual scaffolding with algorithmic scaffolding to automatically construct efficient and scalable factuality benchmarks. The resulting benchmark presents a significant challenge to state-of-the-art large language models, empirically validating the superiority of reasoning-oriented LLMs in factual consistency. Furthermore, the study demonstrates that optimizing the retrieval component yields greater cost-effectiveness than scaling up the generator model.
📝 Abstract
Static benchmarks for RAG systems often suffer from rapid saturation and require significant manual effort to maintain robustness. To address this, we present IRB, a framework for automatically generating benchmarks to evaluate the factuality of RAG systems. IRB employs a structured generation pipeline utilizing \textit{factual scaffold} and \textit{algorithmic scaffold}. We utilize IRB to construct a benchmark and evaluate frontier LLMs and retrievers. Our results demonstrate that IRB poses a significant challenge for frontier LLMs in the closed-book setting. Furthermore, our evaluation suggests that reasoning LLMs are more reliable, and that improving the retrieval component may yield more cost-effective gains in RAG system correctness than scaling the generator.