🤖 AI Summary
Large language models (LLMs) exhibit unreliable mathematical reasoning, often relying on superficial shortcuts rather than sound logical derivation.
Method: This paper proposes an alignment training framework that jointly incorporates step-level and final-result binary feedback. It is the first to co-model fine-grained step-level binary signals with outcome feedback within an enhanced KTO (Kahneman–Tversky Optimization) paradigm, integrating chain-of-thought trajectory sampling and dual-granularity reward modeling to enforce logically coherent and verifiable reasoning paths.
Contribution/Results: Evaluated on benchmarks including MATH-500, the method achieves significant gains in Pass@1 accuracy, while simultaneously improving intermediate-step correctness and logical consistency. By moving beyond answer-only optimization, this work establishes a novel paradigm for enhancing the trustworthiness and interpretability of mathematical reasoning in LLMs.
📝 Abstract
Large language models (LLMs) have recently demonstrated remarkable success in mathematical reasoning. Despite progress in methods like chain-of-thought prompting and self-consistency sampling, these advances often focus on final correctness without ensuring that the underlying reasoning process is coherent and reliable. This paper introduces Step-KTO, a training framework that combines process-level and outcome-level binary feedback to guide LLMs toward more trustworthy reasoning trajectories. By providing binary evaluations for both the intermediate reasoning steps and the final answer, Step-KTO encourages the model to adhere to logical progressions rather than relying on superficial shortcuts. Our experiments on challenging mathematical benchmarks show that Step-KTO significantly improves both final answer accuracy and the quality of intermediate reasoning steps. For example, on the MATH-500 dataset, Step-KTO achieves a notable improvement in Pass@1 accuracy over strong baselines. These results highlight the promise of integrating stepwise process feedback into LLM training, paving the way toward more interpretable and dependable reasoning capabilities.