Revisiting Human-vs-LLM judgments using the TREC Podcast Track

📅 2026-01-09

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This study investigates the agreement between large language models (LLMs) and human assessors in relevance judgments within non-traditional retrieval settings—specifically, podcast audio transcript snippets—and examines how such agreement influences system ranking. Leveraging data from the TREC 2020–2021 Podcast Track, five distinct LLMs were employed to relabel all query-passage pairs, with expert re-evaluation introduced for cases exhibiting the greatest divergence between LLM and original human assessments. This work extends LLM–human consistency research into the podcast transcription domain for the first time, revealing that experts more frequently align with LLM judgments than with the original assessors, thereby highlighting the limitations of single-assessor labeling. The findings demonstrate significantly higher agreement between LLMs and expert reviewers compared to original assessors, supporting the adoption of multi-assessor protocols or LLM-augmented evaluation to enhance annotation reliability.

Technology Category

Application Category

📝 Abstract

Using large language models (LLMs) to annotate relevance is an increasingly important technique in the information retrieval community. While some studies demonstrate that LLMs can achieve high user agreement with ground truth (human) judgments, other studies have argued for the opposite conclusion. To the best of our knowledge, these studies have primarily focused on classic ad-hoc text search scenarios. In this paper, we conduct an analysis on user agreement between LLM and human experts, and explore the impact disagreement has on system rankings. In contrast to prior studies, we focus on a collection composed of audio files that are transcribed into two-minute segments -- the TREC 2020 and 2021 podcast track. We employ five different LLM models to re-assess all of the query-segment pairs, which were originally annotated by TREC assessors. Furthermore, we re-assess a small subset of pairs where LLM and TREC assessors have the highest disagreement, and found that the human experts tend to agree with LLMs more than with the TREC assessors. Our results reinforce the previous insights of Sormunen in 2002 -- that relying on a single assessor leads to lower user agreement.

Problem

Research questions and friction points this paper is trying to address.

LLM-human disagreement

relevance judgment

information retrieval

TREC Podcast Track

assessor agreement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models

Relevance Judgment

Information Retrieval