Value-Guided Search for Efficient Chain-of-Thought Reasoning

📅 2025-05-23

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Manual step definition in long-context reasoning impedes efficient value modeling. Method: We propose a token-level value modeling paradigm that eliminates explicit step segmentation, trained on 2.5 million reasoning trajectories; it enables block-level Value-Guided Search (VGS) on a 1.5B-parameter model—marking the first such approach independent of fine-grained step boundaries—and introduces final weighted majority voting, outperforming standard majority voting and best-of-n under identical generation budgets. Contribution/Results: Our method achieves a 45.7% average accuracy across four mathematics competition benchmarks (AIME/HMMT 2024 & 2025), matching o3-mini-medium; it reduces FLOPs substantially at 64-generation budget. Code, datasets, and models are fully open-sourced.

Technology Category

Application Category

📝 Abstract

In this paper, we propose a simple and efficient method for value model training on long-context reasoning traces. Compared to existing process reward models (PRMs), our method does not require a fine-grained notion of"step,"which is difficult to define for long-context reasoning models. By collecting a dataset of 2.5 million reasoning traces, we train a 1.5B token-level value model and apply it to DeepSeek models for improved performance with test-time compute scaling. We find that block-wise value-guided search (VGS) with a final weighted majority vote achieves better test-time scaling than standard methods such as majority voting or best-of-n. With an inference budget of 64 generations, VGS with DeepSeek-R1-Distill-1.5B achieves an average accuracy of 45.7% across four competition math benchmarks (AIME 2024&2025, HMMT Feb 2024&2025), reaching parity with o3-mini-medium. Moreover, VGS significantly reduces the inference FLOPs required to achieve the same performance of majority voting. Our dataset, model and codebase are open-sourced.

Problem

Research questions and friction points this paper is trying to address.

Improving efficiency in long-context chain-of-thought reasoning

Reducing FLOPs for achieving performance parity in reasoning models

Training value models without fine-grained step definitions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Token-level value model training

Block-wise value-guided search

Weighted majority vote scaling

🔎 Similar Papers

No similar papers found.