Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback

📅 2025-01-18

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

Large language models (LLMs) exhibit unreliable mathematical reasoning, often relying on superficial shortcuts rather than sound logical derivation. Method: This paper proposes an alignment training framework that jointly incorporates step-level and final-result binary feedback. It is the first to co-model fine-grained step-level binary signals with outcome feedback within an enhanced KTO (Kahneman–Tversky Optimization) paradigm, integrating chain-of-thought trajectory sampling and dual-granularity reward modeling to enforce logically coherent and verifiable reasoning paths. Contribution/Results: Evaluated on benchmarks including MATH-500, the method achieves significant gains in Pass@1 accuracy, while simultaneously improving intermediate-step correctness and logical consistency. By moving beyond answer-only optimization, this work establishes a novel paradigm for enhancing the trustworthiness and interpretability of mathematical reasoning in LLMs.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have recently demonstrated remarkable success in mathematical reasoning. Despite progress in methods like chain-of-thought prompting and self-consistency sampling, these advances often focus on final correctness without ensuring that the underlying reasoning process is coherent and reliable. This paper introduces Step-KTO, a training framework that combines process-level and outcome-level binary feedback to guide LLMs toward more trustworthy reasoning trajectories. By providing binary evaluations for both the intermediate reasoning steps and the final answer, Step-KTO encourages the model to adhere to logical progressions rather than relying on superficial shortcuts. Our experiments on challenging mathematical benchmarks show that Step-KTO significantly improves both final answer accuracy and the quality of intermediate reasoning steps. For example, on the MATH-500 dataset, Step-KTO achieves a notable improvement in Pass@1 accuracy over strong baselines. These results highlight the promise of integrating stepwise process feedback into LLM training, paving the way toward more interpretable and dependable reasoning capabilities.

Problem

Research questions and friction points this paper is trying to address.

Language Models

Mathematical Problem Solving

Reasoning Reliability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Step-KTO

Incremental Feedback

Mathematical Problem Solving

🔎 Similar Papers

BloomWise: Enhancing Problem-Solving capabilities of Large Language Models using Bloom's-Taxonomy-Inspired Prompts