An Investigation of Prompt Variations for Zero-shot LLM-based Rankers

📅 2024-06-20
🏛️ arXiv.org
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the extent to which prompt design influences zero-shot large language model (LLM) ranking performance—and whether its impact surpasses that of ranking algorithm selection and underlying model architecture (e.g., GPT-3.5, FLAN-T5). Through large-scale, controlled ablation experiments, we systematically disentangle the effects of prompt components (e.g., role specification phrasing), ranking paradigms (pairwise vs. listwise), and model architecture. Our key finding—quantified for the first time—is that fine-grained prompt design significantly outweighs algorithmic differences across multiple evaluation scenarios; moreover, algorithmic advantages substantially diminish under prompt perturbations. This work challenges the prevailing consensus that attributes ranking performance primarily to algorithm or model choice, and establishes a new, reproducible, and attribution-aware benchmark for LLM-based ranking research.

Technology Category

Application Category

📝 Abstract
We provide a systematic understanding of the impact of specific components and wordings used in prompts on the effectiveness of rankers based on zero-shot Large Language Models (LLMs). Several zero-shot ranking methods based on LLMs have recently been proposed. Among many aspects, methods differ across (1) the ranking algorithm they implement, e.g., pointwise vs. listwise, (2) the backbone LLMs used, e.g., GPT3.5 vs. FLAN-T5, (3) the components and wording used in prompts, e.g., the use or not of role-definition (role-playing) and the actual words used to express this. It is currently unclear whether performance differences are due to the underlying ranking algorithm, or because of spurious factors such as better choice of words used in prompts. This confusion risks to undermine future research. Through our large-scale experimentation and analysis, we find that ranking algorithms do contribute to differences between methods for zero-shot LLM ranking. However, so do the LLM backbones -- but even more importantly, the choice of prompt components and wordings affect the ranking. In fact, in our experiments, we find that, at times, these latter elements have more impact on the ranker's effectiveness than the actual ranking algorithms, and that differences among ranking methods become more blurred when prompt variations are considered.
Problem

Research questions and friction points this paper is trying to address.

LLM-based Models
Prompt Influence
Game Ranking
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs Ranking Games
Prompt Engineering
Game Outcome Influence
🔎 Similar Papers
No similar papers found.
Shuoqi Sun
Shuoqi Sun
PhD student at RMIT University
Information RetrievalSearch System
Shengyao Zhuang
Shengyao Zhuang
Amazon, AGI
Information RetrievalNLP
S
Shuai Wang
The University of Queensland, St. Lucia, Australia
G
G. Zuccon
The University of Queensland, St. Lucia, Australia