🤖 AI Summary
This work investigates how document summarization affects the reliability of large language models (LLMs) in automated relevance assessment and its downstream impact on information retrieval (IR) evaluation. We systematically compare LLM-generated relevance judgments—produced from multi-length summaries—against those derived from full documents, across multiple TREC benchmarks. Our method employs state-of-the-art LLMs to generate abstractive summaries of varying lengths and evaluates their effects on label distribution, ranking stability, and inter-system correlation. Results show that while summaries preserve retrieval system ranking stability, they induce model- and dataset-dependent shifts in relevance label distributions; both summary length and LLM choice significantly modulate annotation bias. This study is the first to characterize summarization as a trade-off proxy for IR evaluation: it improves computational efficiency but risks compromising annotation reliability. We provide methodological cautions and practical guidelines for deploying LLM-based automatic evaluation in IR, highlighting the need for careful bias mitigation and summary-aware calibration strategies.
📝 Abstract
Relevance judgments are central to the evaluation of Information Retrieval (IR) systems, but obtaining them from human annotators is costly and time-consuming. Large Language Models (LLMs) have recently been proposed as automated assessors, showing promising alignment with human annotations. Most prior studies have treated documents as fixed units, feeding their full content directly to LLM assessors. We investigate how text summarization affects the reliability of LLM-based judgments and their downstream impact on IR evaluation. Using state-of-the-art LLMs across multiple TREC collections, we compare judgments made from full documents with those based on LLM-generated summaries of different lengths. We examine their agreement with human labels, their effect on retrieval effectiveness evaluation, and their influence on IR systems' ranking stability. Our findings show that summary-based judgments achieve comparable stability in systems' ranking to full-document judgments, while introducing systematic shifts in label distributions and biases that vary by model and dataset. These results highlight summarization as both an opportunity for more efficient large-scale IR evaluation and a methodological choice with important implications for the reliability of automatic judgments.