Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences

📅 2026-02-25

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work proposes an unsupervised test-time optimization method to enhance the generation quality of large language models (LLMs) in discrete output spaces when reliable scalar reward signals are unavailable. The approach leverages the LLM’s own pairwise preferences over candidate outputs as an intrinsic optimization signal, estimating candidate quality via a Bayesian Bradley–Terry model. It efficiently allocates comparison budgets and selects high-quality parents through Double Thompson Sampling, enabling unsupervised evolutionary optimization. Experimental results demonstrate that the method improves accuracy by 20 percentage points on MathBench and outperforms existing iterative approaches by over 12 percentage points on LiveCodeBench, substantially reducing reliance on external reward models, ground-truth labels, or human annotations.

Technology Category

Application Category

📝 Abstract

Many applications seek to optimize LLM outputs at test time by iteratively proposing, scoring, and refining candidates over a discrete output space. Existing methods use a calibrated scalar evaluator for the target objective to guide search, but for many tasks such scores are unavailable, too sparse, or unreliable. Pairwise comparisons, by contrast, are often easier to elicit, still provide useful signal on improvement directions, and can be obtained from the LLM itself without external supervision. Building on this observation, we introduce Duel-Evolve, an evolutionary optimization algorithm that replaces external scalar rewards with pairwise preferences elicited from the same LLM used to generate candidates. Duel-Evolve aggregates these noisy candidate comparisons via a Bayesian Bradley-Terry model, yielding uncertainty-aware estimates of candidate quality. These quality estimates guide allocation of the comparison budget toward plausible optima using Double Thompson Sampling, as well as selection of high-quality parents to generate improved candidates. We evaluate Duel-Evolve on MathBench, where it achieves 20 percentage points higher accuracy over existing methods and baselines, and on LiveCodeBench, where it improves over comparable iterative methods by over 12 percentage points. Notably, the method requires no reward model, no ground-truth labels during search, and no hand-crafted scoring function. Results show that pairwise self-preferences provide strong optimization signal for test-time improvement over large, discrete output spaces.

Problem

Research questions and friction points this paper is trying to address.

reward-free optimization

test-time scaling

pairwise preferences

LLM self-evaluation

discrete output space

Innovation

Methods, ideas, or system contributions that make the work stand out.

reward-free optimization

pairwise preferences

test-time scaling