🤖 AI Summary
Large language models (LLMs) frequently produce incorrect answers in mathematical reasoning due to logical inconsistencies in multi-step chain-of-thought (CoT) reasoning; existing mitigation strategies rely heavily on extensive sampling or step-level annotations, entailing high computational cost and poor generalization. This paper proposes EORM, a lightweight posterior verifier that trains an energy-based model (EBM) solely on final-answer labels—using discriminator logits as negative energies—to automatically rank CoT candidates by implicit logical consistency. EORM pioneers the integration of EBMs with result-only supervision, enabling reasoning quality discrimination without step-level annotations. Evaluated on GSM8K and MATH, Llama-3-8B augmented with EORM achieves 90.7% and 63.7% accuracy, respectively—matching or surpassing brute-force sampling—while significantly improving both reliability and inference efficiency.
📝 Abstract
Mathematical reasoning presents a significant challenge for Large Language Models (LLMs), often requiring robust multi step logical consistency. While Chain of Thought (CoT) prompting elicits reasoning steps, it doesn't guarantee correctness, and improving reliability via extensive sampling is computationally costly. This paper introduces the Energy Outcome Reward Model (EORM), an effective, lightweight, post hoc verifier. EORM leverages Energy Based Models (EBMs) to simplify the training of reward models by learning to assign a scalar energy score to CoT solutions using only outcome labels, thereby avoiding detailed annotations. It achieves this by interpreting discriminator output logits as negative energies, effectively ranking candidates where lower energy is assigned to solutions leading to correct final outcomes implicitly favoring coherent reasoning. On mathematical benchmarks (GSM8k, MATH), EORM significantly improves final answer accuracy (e.g., with Llama 3 8B, achieving 90.7% on GSM8k and 63.7% on MATH). EORM effectively leverages a given pool of candidate solutions to match or exceed the performance of brute force sampling, thereby enhancing LLM reasoning outcome reliability through its streamlined post hoc verification process.