🤖 AI Summary
This study addresses the issue of resource redundancy in current large language models (LLMs) when applied to relevance judgment tasks. It presents the first systematic investigation into the feasibility of directly employing reranking models as relevance assessors, introducing two adaptation strategies: binarized output and score thresholding. The evaluation spans three model families and eight model scales (ranging from 220M to 32B parameters), demonstrating that reranking models outperform the state-of-the-art LLM-based assessor UMBRELA in approximately 40%–50% of scenarios on the TREC-DL 2019–2023 benchmarks. The analysis further uncovers significant self-preference and cross-family evaluation biases in LLM-based assessors. This work establishes a new paradigm for efficiently repurposing existing reranking models as lightweight, high-performing alternatives to LLMs in relevance assessment.
📝 Abstract
Using large language models (LLMs) to predict relevance judgments has shown promising results. Most studies treat this task as a distinct research line, e.g., focusing on prompt design for predicting relevance labels given a query and passage. However, predicting relevance judgments is essentially a form of relevance prediction, a problem extensively studied in tasks such as re-ranking. Despite this potential overlap, little research has explored reusing or adapting established re-ranking methods to predict relevance judgments, leading to potential resource waste and redundant development. To bridge this gap, we reproduce re-rankers in a re-ranker-as-relevance-judge setup. We design two adaptation strategies: (i) using binary tokens (e.g.,"true"and"false") generated by a re-ranker as direct judgments, and (ii) converting continuous re-ranking scores into binary labels via thresholding. We perform extensive experiments on TREC-DL 2019 to 2023 with 8 re-rankers from 3 families, ranging from 220M to 32B, and analyse the evaluation bias exhibited by re-ranker-based judges. Results show that re-ranker-based relevance judges, under both strategies, can outperform UMBRELA, a state-of-the-art LLM-based relevance judge, in around 40% to 50% of the cases; they also exhibit strong self-preference towards their own and same-family re-rankers, as well as cross-family bias.