Completing Missing Annotation: Multi-Agent Debate for Accurate and Scalable Relevant Assessment for IR Benchmarks

📅 2026-02-06

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This work addresses the pervasive issue of unannotated relevant text passages in information retrieval (IR) benchmarks, which introduces evaluation bias and distorts ranking performance. To tackle this challenge, the authors propose DREAM, a novel framework featuring a consensus-driven multi-agent adversarial debate mechanism. By leveraging iterative peer evaluations among large language models (LLMs) and consistency-guided human-in-the-loop adjudication, DREAM effectively mitigates LLM overconfidence while efficiently generating high-quality relevance labels. With only 3.5% human intervention, the method achieves 95.2% annotation accuracy, recovers 29,824 previously missing relevant passages, and substantially corrects ranking distortions in IR systems as well as retrieval-generation misalignments in retrieval-augmented generation (RAG). The resulting benchmark, BRIDGE, offers a more equitable foundation for IR evaluation.

Technology Category

Application Category

📝 Abstract

Information retrieval (IR) evaluation remains challenging due to incomplete IR benchmark datasets that contain unlabeled relevant chunks. While LLMs and LLM-human hybrid strategies reduce costly human effort, they remain prone to LLM overconfidence and ineffective AI-to-human escalation. To address this, we propose DREAM, a multi-round debate-based relevance assessment framework with LLM agents, built on opposing initial stances and iterative reciprocal critique. Through our agreement-based debate, it yields more accurate labeling for certain cases and more reliable AI-to-human escalation for uncertain ones, achieving 95.2% labeling accuracy with only 3.5% human involvement. Using DREAM, we build BRIDGE, a refined benchmark that mitigates evaluation bias and enables fairer retriever comparison by uncovering 29,824 missing relevant chunks. We then re-benchmark IR systems and extend evaluation to RAG, showing that unaddressed holes not only distort retriever rankings but also drive retrieval-generation misalignment. The relevance assessment framework is available at https: //github.com/DISL-Lab/DREAM-ICLR-26; and the BRIDGE dataset is available at https://github.com/DISL-Lab/BRIDGE-Benchmark.

Problem

Research questions and friction points this paper is trying to address.

missing annotation

relevance assessment

IR benchmarks

evaluation bias

retrieval-generation misalignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-agent debate

relevance assessment

missing annotation completion