Learning to Rank Chain-of-Thought: An Energy-Based Approach with Outcome Supervision

📅 2025-05-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) frequently produce incorrect answers in mathematical reasoning due to logical inconsistencies in multi-step chain-of-thought (CoT) reasoning; existing mitigation strategies rely heavily on extensive sampling or step-level annotations, entailing high computational cost and poor generalization. This paper proposes EORM, a lightweight posterior verifier that trains an energy-based model (EBM) solely on final-answer labels—using discriminator logits as negative energies—to automatically rank CoT candidates by implicit logical consistency. EORM pioneers the integration of EBMs with result-only supervision, enabling reasoning quality discrimination without step-level annotations. Evaluated on GSM8K and MATH, Llama-3-8B augmented with EORM achieves 90.7% and 63.7% accuracy, respectively—matching or surpassing brute-force sampling—while significantly improving both reliability and inference efficiency.

Technology Category

Application Category

📝 Abstract
Mathematical reasoning presents a significant challenge for Large Language Models (LLMs), often requiring robust multi step logical consistency. While Chain of Thought (CoT) prompting elicits reasoning steps, it doesn't guarantee correctness, and improving reliability via extensive sampling is computationally costly. This paper introduces the Energy Outcome Reward Model (EORM), an effective, lightweight, post hoc verifier. EORM leverages Energy Based Models (EBMs) to simplify the training of reward models by learning to assign a scalar energy score to CoT solutions using only outcome labels, thereby avoiding detailed annotations. It achieves this by interpreting discriminator output logits as negative energies, effectively ranking candidates where lower energy is assigned to solutions leading to correct final outcomes implicitly favoring coherent reasoning. On mathematical benchmarks (GSM8k, MATH), EORM significantly improves final answer accuracy (e.g., with Llama 3 8B, achieving 90.7% on GSM8k and 63.7% on MATH). EORM effectively leverages a given pool of candidate solutions to match or exceed the performance of brute force sampling, thereby enhancing LLM reasoning outcome reliability through its streamlined post hoc verification process.
Problem

Research questions and friction points this paper is trying to address.

Improving reliability of LLM mathematical reasoning with CoT
Reducing computational cost of verifying CoT correctness
Enhancing accuracy of final answers in math benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Energy Outcome Reward Model for lightweight verification
Uses Energy Based Models to score solutions
Improves accuracy with outcome labels only
🔎 Similar Papers
No similar papers found.