Fine Grained Evaluation of LLMs-as-Judges

📅 2026-01-13

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This work proposes a fine-grained evaluation framework to assess the capability of large language models (LLMs) as relevance judges in information retrieval, extending beyond holistic document-level judgments to identify the specific textual spans that support those judgments. Leveraging a Wikipedia test collection derived from INEX, the study employs prompt engineering to guide LLMs in simultaneously performing document-level relevance assessment and span-level annotation, followed by comparative analysis against human annotations. By introducing fine-grained relevance evaluation into the LLMs-as-Judges paradigm, this research is the first to examine whether models are “right for the right reasons,” thereby substantially enhancing the credibility of automated evaluation. Experimental results demonstrate that, under human supervision, LLMs can accurately identify both relevant documents and the key evidence spans within them.

Technology Category

Application Category

📝 Abstract

A good deal of recent research has focused on how Large Language Models (LLMs) may be used as `judges'in place of humans to evaluate the quality of the output produced by various text / image processing systems. Within this broader context, a number of studies have investigated the specific question of how effectively LLMs can be used as relevance assessors for the standard ad hoc task in Information Retrieval (IR). We extend these studies by looking at additional questions. Most importantly, we use a Wikipedia based test collection created by the INEX initiative, and prompt LLMs to not only judge whether documents are relevant / non-relevant, but to highlight relevant passages in documents that it regards as useful. The human relevance assessors involved in creating this collection were given analogous instructions, i.e., they were asked to highlight all passages within a document that respond to the information need expressed in a query. This enables us to evaluate the quality of LLMs as judges not only at the document level, but to also quantify how often these `judges'are right for the right reasons. Our findings suggest that LLMs-as-judges work best under human supervision.

Problem

Research questions and friction points this paper is trying to address.

LLMs-as-Judges

Fine-grained Evaluation

Relevance Assessment

Information Retrieval

Passage-level Judgement

Innovation

Methods, ideas, or system contributions that make the work stand out.

fine-grained evaluation

LLMs-as-judges

relevance assessment