OpenDeepThink: Parallel Reasoning via Bradley--Terry Aggregation

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses the challenge of reliably selecting among multiple reasoning candidates generated by large language models at test time, particularly in unsupervised settings where pairwise scoring is prone to noise and bias. The authors propose a test-time compute framework inspired by population evolution: candidate solutions undergo random pairwise comparisons, with global rankings inferred via the Bradley-Terry model to aggregate preferences; high-quality candidates are iteratively retained and refined through natural language critiques that guide mutation. This approach introduces Bradley-Terry-based pairwise comparison into LLM inference selection for the first time and transfers across models without fine-tuning. Experiments show that eight iterations (≈27 minutes) on Gemini 1.5 Pro yield a 405-point gain in Codeforces Elo rating; performance improves significantly on multi-domain HLE objective tasks, with minor degradation on subjective ones. The study also releases CF-73, a dataset of 73 expert-annotated problems.

📝 Abstract

Test-time compute scaling is a primary axis for improving LLM reasoning. Existing methods primarily scale depth by extending a single reasoning trace. Scaling breadth by sampling multiple candidates in parallel is straightforward, but introduces a selection bottleneck: choosing the best candidate without a ground-truth verifier, since pointwise LLM judging is noisy and biased. To address this, we introduce OpenDeepThink, a population-based test-time compute framework that selects via pairwise Bradley-Terry comparison. Each generation, the LLM judges random pairs of candidates and aggregates votes via Bradley-Terry into a global ranking; top-ranked candidates are preserved and the top three quarters are mutated using the natural-language critiques produced during comparison; the bottom quarter is discarded. OpenDeepThink raises Gemini 3.1 Pro's effective Codeforces Elo by +405 points in eight sequential LLM-call rounds (~27 minutes wall-clock). The pipeline transfers across weaker and stronger models without retuning, and on the multi-domain HLE benchmark, gains appear concentrated in objectively verifiable domains and reverse in subjective ones. We release CF-73, a curated set of 73 expert-rated Codeforces problems with International Grandmaster annotation and 99% local-evaluation agreement against the official verdict.

Problem

Research questions and friction points this paper is trying to address.

test-time compute

LLM reasoning

candidate selection

noisy judgment

parallel reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

test-time compute scaling

parallel reasoning

Bradley-Terry aggregation