🤖 AI Summary
Existing generative reward models for reasoning tasks rely either on pointwise scoring or pairwise comparison; the former underutilizes large language models’ (LLMs) strong comparative reasoning capability, while the latter suffers from poor scalability under high sampling budgets. Method: We propose GenSelect—the first framework to embed a generative selection mechanism into the Best-of-N paradigm, guiding LLMs (e.g., QwQ, DeepSeek-R1-0528) via chain-of-thought prompting to directly select the optimal solution from N parallel candidate outputs. Contribution/Results: GenSelect synergistically leverages LLMs’ semantic comparison strength and enables efficient test-time scaling. Experiments on mathematical reasoning benchmarks demonstrate that GenSelect significantly outperforms conventional pointwise and pairwise reward modeling, achieving superior performance with only lightweight prompts—validating its effectiveness, simplicity, and scalability.
📝 Abstract
Generative reward models with parallel sampling have enabled effective test-time scaling for reasoning tasks. Current approaches employ pointwise scoring of individual solutions or pairwise comparisons. However, pointwise methods underutilize LLMs' comparative abilities, while pairwise methods scale inefficiently with larger sampling budgets. We introduce GenSelect, where the LLM uses long reasoning to select the best solution among N candidates. This leverages LLMs' comparative strengths while scaling efficiently across parallel sampling budgets. For math reasoning, we demonstrate that reasoning models, such as QwQ and DeepSeek-R1-0528, excel at GenSelect, outperforming existing scoring approaches with simple prompting.