🤖 AI Summary
Large language models (LLMs) exhibit positional bias—where candidate ordering influences ranking/evaluation outcomes—and low repetition consistency—yielding unstable predictions for identical inputs—thereby undermining reliability. To address these issues, we propose a dynamic repetition strategy featuring the first confidence-driven early-stopping mechanism: for each input instance, it adaptively estimates the minimal required number of repetitions, then integrates majority voting with explicit positional bias modeling for fine-grained correction. Unlike static repetition schemes, our method eliminates the need for pre-specified repetition counts. We validate it across three LLM scales and two distinct task categories. Results show that our approach reduces average model calls by 81%–87% compared to static repetition, while preserving high ranking accuracy. This yields significant improvements in both computational efficiency and robustness without sacrificing performance.
📝 Abstract
When using LLMs to rank items based on given criteria, or evaluate answers, the order of candidate items can influence the model's final decision. This sensitivity to item positioning in a LLM's prompt is known as position bias. Prior research shows that this bias exists even in large models, though its severity varies across models and tasks. In addition to position bias, LLMs also exhibit varying degrees of low repetition consistency, where repeating the LLM call with the same candidate ordering can lead to different rankings. To address both inconsistencies, a common approach is to prompt the model multiple times with different candidate orderings and aggregate the results via majority voting. However, this repetition strategy, significantly increases computational costs. Extending prior findings, we observe that both the direction -- favoring either the earlier or later candidate in the prompt -- and magnitude of position bias across instances vary substantially, even within a single dataset. This observation highlights the need for a per-instance mitigation strategy. To this end, we introduce a dynamic early-stopping method that adaptively determines the number of repetitions required for each instance. Evaluating our approach across three LLMs of varying sizes and on two tasks, namely re-ranking and alignment, we demonstrate that transitioning to a dynamic repetition strategy reduces the number of LLM calls by an average of 81%, while preserving the accuracy. Furthermore, we propose a confidence-based adaptation to our early-stopping method, reducing LLM calls by an average of 87% compared to static repetition, with only a slight accuracy trade-off relative to our original early-stopping method.