RoBoN: Routed Online Best-of-n for Test-Time Scaling with Multiple LLMs

📅 2025-12-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional best-of-n methods rely on a single large language model (LLM) for response generation, overlooking the complementary strengths of diverse models across tasks. This paper proposes RoBoN: a training-free, online routing best-of-n framework that incurs no additional computational overhead. RoBoN is the first to introduce model diversity into inference-time dynamic scheduling—sequentially invoking multiple plug-and-play LLMs and selecting the optimal response via joint evaluation of reward-model scoring and cross-model response consistency. Its core innovation lies in achieving multi-model collaborative enhancement without introducing new parameters or requiring fine-tuning. Evaluated on multiple reasoning benchmarks, RoBoN consistently outperforms single-model best-of-n baselines, yielding up to a 3.4 percentage-point absolute accuracy gain, and surpasses uniform ensemble baselines.

Technology Category

Application Category

📝 Abstract
Best-of-$n$ is a widely used test-time scaling approach for LLM inference. Yet despite evidence that LLMs exhibit complementary strengths across tasks, traditionally best-of-$n$ relies on a single model to generate responses. We propose RoBoN (Routed Online Best-of-$n$), a sequential multi-LLM alternative to the prevailing single-model best-of-$n$. Given a suite of models ${m_i}_{i=1}^M$, RoBoN sequentially routes generations one-by-one across models, based on scores computed using a reward model and an agreement signal on the predicted responses. This online routing requires no additional training, keeps compute parity, and works with any plug-in reward model. Across reasoning benchmarks (MATH500, OlympiadBench, MinervaMath, GSM8K, MMLU), RoBoN consistently outperforms standard best-of-$n$ applied to each individual model for larger $n$, with gains of up to 3.4% in absolute accuracy, and also improves over a uniform multi-model portfolio baseline. Our results indicate that diversity across models can be exploited at inference to improve best-of-$n$ performance over any constituent model alone, providing a simple, training-free path to test-time scaling with multiple LLMs.
Problem

Research questions and friction points this paper is trying to address.

Improves test-time scaling by routing responses across multiple LLMs
Sequentially selects models using reward scores and agreement signals
Enhances accuracy over single-model best-of-n without additional training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sequentially routes responses across multiple LLMs
Uses reward model and agreement signal for routing
Training-free method improving accuracy over single-model best-of-n
🔎 Similar Papers