🤖 AI Summary
Current large language models (LLMs) exhibit limited performance on deductive logical reasoning tasks, and the synergistic optimization of test-time scaling with reward modeling remains underexplored. This paper proposes EchoRM: first, leveraging LLMs’ “echoic” reflective tendency toward erroneous hypotheses to actively induce and capture novel reasoning errors, substantially improving error-pattern coverage; second, generating high-quality positive and negative examples via chain-of-thought prompting to train a dedicated Result Reward Model (RRM); third, integrating the RRM into test-time scaling to enable reasoning-path re-ranking and correction. Evaluated on three logical reasoning benchmarks—FOLIO, JustLogic, and ProverQA—EchoRM consistently improves performance across four mainstream LLMs, demonstrating its effectiveness, robustness, and cross-model generalizability.
📝 Abstract
Logical reasoning is a critical benchmark for evaluating the capabilities of large language models (LLMs), as it reflects their ability to derive valid conclusions from given premises. While the combination of test-time scaling with dedicated outcome or process reward models has opened up new avenues to enhance LLMs performance in complex reasoning tasks, this space is under-explored in deductive logical reasoning. We present a set of Outcome Reward Models (ORMs) for deductive reasoning. To train the ORMs we mainly generate data using Chain-of-Thought (CoT) with single and multiple samples. Additionally, we propose a novel tactic to further expand the type of errors covered in the training dataset of the ORM. In particular, we propose an echo generation technique that leverages LLMs' tendency to reflect incorrect assumptions made in prompts to extract additional training data, covering previously unexplored error types. While a standard CoT chain may contain errors likely to be made by the reasoner, the echo strategy deliberately steers the model toward incorrect reasoning. We show that ORMs trained on CoT and echo-augmented data demonstrate improved performance on the FOLIO, JustLogic, and ProverQA datasets across four different LLMs.