Query Performance Prediction using Relevance Judgments Generated by Large Language Models

📅 2024-04-01

🏛️ ACM Transactions on Information Systems

📈 Citations: 11

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Existing query performance prediction (QPP) methods output only a single scalar score, limiting adaptability to diverse IR evaluation metrics and lacking interpretability. This paper proposes QPP-GenRE, the first framework to formulate QPP as a document-level relevance prediction task. Leveraging open-weight large language models (e.g., Llama, Falcon), it generates high-quality pseudo-relevance labels and employs supervised fine-tuning to yield metric-agnostic, interpretable, and self-correctable predictions. By eliminating reliance on human annotations or metric-specific training, QPP-GenRE introduces a recall-aware approximate evaluation strategy to enhance robustness. Extensive experiments on TREC 2019–2022 and CAsT-19–20 demonstrate that QPP-GenRE achieves state-of-the-art performance across both lexical and neural rankers, significantly outperforming existing zero- and few-shot QPP approaches.

Technology Category

Application Category

📝 Abstract

Query performance prediction (QPP) aims to estimate the retrieval quality of a search system for a query without human relevance judgments. Previous QPP methods typically return a single scalar value and do not require the predicted values to approximate a specific information retrieval (IR) evaluation measure, leading to certain drawbacks: (i) a single scalar is insufficient to accurately represent different IR evaluation measures, especially when metrics do not highly correlate, and (ii) a single scalar limits the interpretability of QPP methods because solely using a scalar is insufficient to explain QPP results. To address these issues, we propose a QPP framework using automatically generated relevance judgments (QPP-GenRE), which decomposes QPP into independent subtasks of predicting the relevance of each item in a ranked list to a given query. This allows us to predict any IR evaluation measure using the generated relevance judgments as pseudo-labels. This also allows us to interpret predicted IR evaluation measures, and identify, track and rectify errors in generated relevance judgments to improve QPP quality. We predict an item’s relevance by using open-source large language models (LLMs) to ensure scientific reproducibility. We face two main challenges: (i) excessive computational costs of judging an entire corpus for predicting a metric considering recall, and (ii) limited performance in prompting open-source LLMs in a zero-/few-shot manner. To solve the challenges, we devise an approximation strategy to predict an IR measure considering recall and propose to fine-tune open-source LLMs using human-labeled relevance judgments. Experiments on the TREC 2019–2022 deep learning tracks and CAsT-19–20 datasets show that QPP-GenRE achieves state-of-the-art QPP quality for both lexical and neural rankers.

Problem

Research questions and friction points this paper is trying to address.

Predict query performance without human relevance judgments

Overcome limitations of single scalar QPP methods

Use LLMs to generate and improve relevance judgments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses LLMs for relevance judgment generation

Decomposes QPP into relevance prediction subtasks

Fine-tunes open-source LLMs for better performance

🔎 Similar Papers

Leveraging Large Language Models for Relevance Judgments in Legal Case Retrieval