Topic-Specific Classifiers are Better Relevance Judges than Prompted LLMs

📅 2025-10-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In information retrieval, pooling-based test collections suffer from unjudged documents, leading to incomplete relevance judgments and compromising evaluation reliability. The common practice of treating unjudged documents as non-relevant introduces systematic bias. To address this, we propose a topic-specific relevance classifier built upon the monoT5 model and fine-tuned efficiently via LoRA—requiring only 128 human annotations per topic, drawn exclusively from single-assessor, single-topic judgments. This design avoids circular evaluation induced by large language models (LLMs) acting as “judges” and upholds human assessment as the gold standard. Experiments demonstrate that our method achieves a Spearman correlation coefficient exceeding 0.95 between system rankings and ground-truth rankings—significantly outperforming LLM-as-a-judge approaches—and substantially improves the accuracy and credibility of cross-system comparisons.

Technology Category

Application Category

📝 Abstract
The unjudged document problem, where pooled test collections have incomplete relevance judgments for evaluating new retrieval systems, is a key obstacle to the reusability of test collections in information retrieval. While the de facto standard to deal with the problem is to treat unjudged documents as non-relevant, many alternatives have been proposed, including the use of large language models (LLMs) as a relevance judge (LLM-as-a-judge). However, this has been criticized as circular, since the same LLM can be used as a judge and as a ranker at the same time. We propose to train topic-specific relevance classifiers instead: By finetuning monoT5 with independent LoRA weight adaptation on the judgments of a single assessor for a single topic's pool, we align it to that assessor's notion of relevance for the topic. The system rankings obtained through our classifier's relevance judgments achieve a Spearmans' $ρ$ correlation of $>0.95$ with ground truth system rankings. As little as 128 initial human judgments per topic suffice to improve the comparability of models, compared to treating unjudged documents as non-relevant, while achieving more reliability than existing LLM-as-a-judge approaches. Topic-specific relevance classifiers thus are a lightweight and straightforward way to tackle the unjudged document problem, while maintaining human judgments as the gold standard for retrieval evaluation. Code, models, and data are made openly available.
Problem

Research questions and friction points this paper is trying to address.

Addressing incomplete relevance judgments in pooled test collections
Training topic-specific classifiers to replace LLM-as-judge approaches
Improving retrieval evaluation reliability with minimal human judgments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Finetuning monoT5 with LoRA adaptation
Training topic-specific relevance classifiers
Using 128 human judgments per topic
🔎 Similar Papers
No similar papers found.