🤖 AI Summary
This work addresses the pervasive issue of unannotated relevant text passages in information retrieval (IR) benchmarks, which introduces evaluation bias and distorts ranking performance. To tackle this challenge, the authors propose DREAM, a novel framework featuring a consensus-driven multi-agent adversarial debate mechanism. By leveraging iterative peer evaluations among large language models (LLMs) and consistency-guided human-in-the-loop adjudication, DREAM effectively mitigates LLM overconfidence while efficiently generating high-quality relevance labels. With only 3.5% human intervention, the method achieves 95.2% annotation accuracy, recovers 29,824 previously missing relevant passages, and substantially corrects ranking distortions in IR systems as well as retrieval-generation misalignments in retrieval-augmented generation (RAG). The resulting benchmark, BRIDGE, offers a more equitable foundation for IR evaluation.
📝 Abstract
Information retrieval (IR) evaluation remains challenging due to incomplete IR benchmark datasets that contain unlabeled relevant chunks. While LLMs and LLM-human hybrid strategies reduce costly human effort, they remain prone to LLM overconfidence and ineffective AI-to-human escalation. To address this, we propose DREAM, a multi-round debate-based relevance assessment framework with LLM agents, built on opposing initial stances and iterative reciprocal critique. Through our agreement-based debate, it yields more accurate labeling for certain cases and more reliable AI-to-human escalation for uncertain ones, achieving 95.2% labeling accuracy with only 3.5% human involvement. Using DREAM, we build BRIDGE, a refined benchmark that mitigates evaluation bias and enables fairer retriever comparison by uncovering 29,824 missing relevant chunks. We then re-benchmark IR systems and extend evaluation to RAG, showing that unaddressed holes not only distort retriever rankings but also drive retrieval-generation misalignment. The relevance assessment framework is available at https: //github.com/DISL-Lab/DREAM-ICLR-26; and the BRIDGE dataset is available at https://github.com/DISL-Lab/BRIDGE-Benchmark.