Logical Reasoning with Outcome Reward Models for Test-Time Scaling

📅 2025-08-27

📈 Citations: 0

✨ Influential: 0

career value

157K/year

🤖 AI Summary

Current large language models (LLMs) exhibit limited performance on deductive logical reasoning tasks, and the synergistic optimization of test-time scaling with reward modeling remains underexplored. This paper proposes EchoRM: first, leveraging LLMs’ “echoic” reflective tendency toward erroneous hypotheses to actively induce and capture novel reasoning errors, substantially improving error-pattern coverage; second, generating high-quality positive and negative examples via chain-of-thought prompting to train a dedicated Result Reward Model (RRM); third, integrating the RRM into test-time scaling to enable reasoning-path re-ranking and correction. Evaluated on three logical reasoning benchmarks—FOLIO, JustLogic, and ProverQA—EchoRM consistently improves performance across four mainstream LLMs, demonstrating its effectiveness, robustness, and cross-model generalizability.

Technology Category

Application Category

📝 Abstract

Logical reasoning is a critical benchmark for evaluating the capabilities of large language models (LLMs), as it reflects their ability to derive valid conclusions from given premises. While the combination of test-time scaling with dedicated outcome or process reward models has opened up new avenues to enhance LLMs performance in complex reasoning tasks, this space is under-explored in deductive logical reasoning. We present a set of Outcome Reward Models (ORMs) for deductive reasoning. To train the ORMs we mainly generate data using Chain-of-Thought (CoT) with single and multiple samples. Additionally, we propose a novel tactic to further expand the type of errors covered in the training dataset of the ORM. In particular, we propose an echo generation technique that leverages LLMs' tendency to reflect incorrect assumptions made in prompts to extract additional training data, covering previously unexplored error types. While a standard CoT chain may contain errors likely to be made by the reasoner, the echo strategy deliberately steers the model toward incorrect reasoning. We show that ORMs trained on CoT and echo-augmented data demonstrate improved performance on the FOLIO, JustLogic, and ProverQA datasets across four different LLMs.

Problem

Research questions and friction points this paper is trying to address.

Enhancing LLMs' deductive logical reasoning with Outcome Reward Models

Expanding training data coverage for error types using echo generation

Improving performance on FOLIO, JustLogic and ProverQA benchmark datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Outcome Reward Models for deductive reasoning

Echo generation technique to expand error coverage

Training data from Chain-of-Thought with multiple samples

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting