🤖 AI Summary
In information retrieval, pooling-based test collections suffer from unjudged documents, leading to incomplete relevance judgments and compromising evaluation reliability. The common practice of treating unjudged documents as non-relevant introduces systematic bias. To address this, we propose a topic-specific relevance classifier built upon the monoT5 model and fine-tuned efficiently via LoRA—requiring only 128 human annotations per topic, drawn exclusively from single-assessor, single-topic judgments. This design avoids circular evaluation induced by large language models (LLMs) acting as “judges” and upholds human assessment as the gold standard. Experiments demonstrate that our method achieves a Spearman correlation coefficient exceeding 0.95 between system rankings and ground-truth rankings—significantly outperforming LLM-as-a-judge approaches—and substantially improves the accuracy and credibility of cross-system comparisons.
📝 Abstract
The unjudged document problem, where pooled test collections have incomplete relevance judgments for evaluating new retrieval systems, is a key obstacle to the reusability of test collections in information retrieval. While the de facto standard to deal with the problem is to treat unjudged documents as non-relevant, many alternatives have been proposed, including the use of large language models (LLMs) as a relevance judge (LLM-as-a-judge). However, this has been criticized as circular, since the same LLM can be used as a judge and as a ranker at the same time. We propose to train topic-specific relevance classifiers instead: By finetuning monoT5 with independent LoRA weight adaptation on the judgments of a single assessor for a single topic's pool, we align it to that assessor's notion of relevance for the topic. The system rankings obtained through our classifier's relevance judgments achieve a Spearmans' $ρ$ correlation of $>0.95$ with ground truth system rankings. As little as 128 initial human judgments per topic suffice to improve the comparability of models, compared to treating unjudged documents as non-relevant, while achieving more reliability than existing LLM-as-a-judge approaches. Topic-specific relevance classifiers thus are a lightweight and straightforward way to tackle the unjudged document problem, while maintaining human judgments as the gold standard for retrieval evaluation. Code, models, and data are made openly available.