Shape of Thought: When Distribution Matters More than Correctness in Reasoning Tasks

📅 2025-12-24

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This study investigates whether models can improve reasoning capability through chain-of-thought (CoT) reasoning structures—even when the final answers are entirely incorrect. Method: The authors propose a “wrong-but-distributionally-close” synthetic CoT training paradigm. Using Qwen, Llama, and Gemma models (1.5B–9B), they generate synthetic CoT data wherein answers are incorrect but intermediate steps remain logically coherent. The approach integrates CoT distillation, self-paraphrased distribution alignment, and progressive error injection. Contribution/Results: Key findings reveal that distributional fidelity of reasoning traces is more critical than answer correctness; models exhibit robust learning from stepwise reasoning despite answer-level errors. Evaluated on MATH, GSM8K, Countdown, and MBPP, the method significantly outperforms human-annotated CoT baselines. These results establish distributional alignment—not answer accuracy—as the central principle for constructing high-quality reasoning data.

Technology Category

Application Category

📝 Abstract

We present the surprising finding that a language model's reasoning capabilities can be improved by training on synthetic datasets of chain-of-thought (CoT) traces from more capable models, even when all of those traces lead to an incorrect final answer. Our experiments show this approach can yield better performance on reasoning tasks than training on human-annotated datasets. We hypothesize that two key factors explain this phenomenon: first, the distribution of synthetic data is inherently closer to the language model's own distribution, making it more amenable to learning. Second, these `incorrect' traces are often only partially flawed and contain valid reasoning steps from which the model can learn. To further test the first hypothesis, we use a language model to paraphrase human-annotated traces -- shifting their distribution closer to the model's own distribution -- and show that this improves performance. For the second hypothesis, we introduce increasingly flawed CoT traces and study to what extent models are tolerant to these flaws. We demonstrate our findings across various reasoning domains like math, algorithmic reasoning and code generation using MATH, GSM8K, Countdown and MBPP datasets on various language models ranging from 1.5B to 9B across Qwen, Llama, and Gemma models. Our study shows that curating datasets that are closer to the model's distribution is a critical aspect to consider. We also show that a correct final answer is not always a reliable indicator of a faithful reasoning process.

Problem

Research questions and friction points this paper is trying to address.

Improving reasoning by training on incorrect synthetic chain-of-thought traces

Shifting data distribution closer to model's own enhances learning performance

Valid reasoning steps in flawed traces can still benefit model training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training on incorrect synthetic CoT traces improves reasoning

Paraphrasing human traces to match model distribution boosts performance

Models learn from partially flawed reasoning steps in synthetic data

🔎 Similar Papers

No similar papers found.