Batched Self-Consistency Improves LLM Relevance Assessment and Ranking

📅 2025-05-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) exhibit limited performance in assessing and ranking candidate text relevance. To address this, we propose a novel method integrating self-consistency with batched pointwise (PW) evaluation. Our core contribution is the first incorporation of subset resampling and permutation perturbation into batched prompt design—enhancing prompt diversity and reasoning robustness. The method enables parallel evaluation of multiple candidates via a single LLM invocation, followed by majority-voting aggregation of self-consistent outputs. We evaluate our approach on GPT-4o, Claude Sonnet 3, and Amazon Nova Pro across three retrieval datasets—including legal domains—and observe up to a 7.5-percentage-point improvement in NDCG@10 over baselines. It significantly outperforms both single-sample PW and listwise ranking methods, demonstrating superior effectiveness and efficiency in relevance assessment and ranking.

Technology Category

Application Category

📝 Abstract
Given some information need, Large Language Models (LLMs) are increasingly used for candidate text relevance assessment, typically using a one-by-one pointwise (PW) strategy where each LLM call evaluates one candidate at a time. Meanwhile, it has been shown that LLM performance can be improved through self-consistency: prompting the LLM to do the same task multiple times (possibly in perturbed ways) and then aggregating the responses. To take advantage of self-consistency, we hypothesize that batched PW strategies, where multiple passages are judged in one LLM call, are better suited than one-by-one PW methods since a larger input context can induce more diverse LLM sampling across self-consistency calls. We first propose several candidate batching strategies to create prompt diversity across self-consistency calls through subset reselection and permutation. We then test our batched PW methods on relevance assessment and ranking tasks against one-by-one PW and listwise LLM ranking baselines with and without self-consistency, using three passage retrieval datasets and GPT-4o, Claude Sonnet 3, and Amazon Nova Pro. We find that batched PW methods outperform all baselines, and show that batching can greatly amplify the positive effects of self-consistency. For instance, on our legal search dataset, GPT-4o one-by-one PW ranking NDCG@10 improves only from 44.9% to 46.8% without self-consistency vs. with 15 self consistency calls, while batched PW ranking improves from 43.8% to 51.3%, respectively.
Problem

Research questions and friction points this paper is trying to address.

Improving LLM relevance assessment via batched self-consistency
Enhancing ranking accuracy with diverse prompt batching strategies
Comparing batched vs one-by-one LLM evaluation methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Batched PW strategies improve LLM relevance assessment
Subset reselection and permutation enhance prompt diversity
Batching amplifies self-consistency benefits in ranking tasks
🔎 Similar Papers
No similar papers found.