🤖 AI Summary
This work addresses a critical flaw in existing chain-of-thought (CoT) fine-tuning data, where correct final answers are often accompanied by flawed intermediate reasoning—such as hallucinations, redundancies, or logical errors. To mitigate this, the authors propose EntroCoT, a novel framework that introduces entropy-guided dynamic reasoning segmentation and step-level marginal contribution evaluation. Specifically, EntroCoT adaptively partitions reasoning trajectories using information entropy and employs Monte Carlo rollouts to quantify each step’s contribution to the final answer, enabling automatic filtering of low-quality samples. Experiments across multiple mathematical reasoning benchmarks demonstrate that models fine-tuned on high-quality CoT datasets curated by EntroCoT significantly outperform those trained on full original datasets, highlighting the effectiveness of the proposed approach in enhancing reasoning fidelity and model performance.
📝 Abstract
Chain-of-Thought (CoT) prompting has significantly enhanced the mathematical reasoning capabilities of Large Language Models. We find existing fine-tuning datasets frequently suffer from the"answer right but reasoning wrong"probelm, where correct final answers are derived from hallucinated, redundant, or logically invalid intermediate steps. This paper proposes EntroCoT, a unified framework for automatically identifying and refining low-quality CoT supervision traces. EntroCoT first proposes an entropy-based mechanism to segment the reasoning trace into multiple steps at uncertain junctures, and then introduces a Monte Carlo rollout-based mechanism to evaluate the marginal contribution of each step. By accurately filtering deceptive reasoning samples, EntroCoT constructs a high-quality dataset where every intermediate step in each reasoning trace facilitates the final answer. Extensive experiments on mathematical benchmarks demonstrate that fine-tuning on the subset constructed by EntroCoT consistently outperforms the baseslines of full-dataset supervision.