Learning from Mistakes: Negative Reasoning Samples Enhance Out-of-Domain Generalization

📅 2026-01-08
🏛️ arXiv.org
📈 Citations: 1
Influential: 1
📄 PDF
🤖 AI Summary
This work addresses the limitations of conventional supervised fine-tuning (SFT), which relies solely on correct reasoning trajectories and often leads to overfitting and poor out-of-domain generalization. The study systematically demonstrates for the first time that negative samples—those containing incorrect final answers but valid intermediate reasoning steps—can simultaneously mitigate overfitting and enhance the model’s exploratory reasoning capabilities. To leverage such samples effectively, the authors propose GLOW, a sample-aware adaptive loss weighting method. Integrated with chain-of-thought fine-tuning and policy entropy optimization, GLOW improves out-of-domain accuracy by 5.51% over standard SFT using only positive samples on Qwen2.5-7B. Furthermore, when used as an initialization for reinforcement learning, it boosts MMLU performance from 72.82% to 76.47%.

Technology Category

Application Category

📝 Abstract
Supervised fine-tuning (SFT) on chain-of-thought (CoT) trajectories demonstrations is a common approach for enabling reasoning in large language models. Standard practices typically only retain trajectories with correct final answers (positives) while ignoring the rest (negatives). We argue that this paradigm discards substantial supervision and exacerbates overfitting, limiting out-of-domain (OOD) generalization. Specifically, we surprisingly find that incorporating negative trajectories into SFT yields substantial OOD generalization gains over positive-only training, as these trajectories often retain valid intermediate reasoning despite incorrect final answers. To understand this effect in depth, we systematically analyze data, training dynamics, and inference behavior, identifying 22 recurring patterns in negative chains that serve a dual role: they moderate loss descent to mitigate overfitting during training and boost policy entropy by 35.67% during inference to facilitate exploration. Motivated by these observations, we further propose Gain-based LOss Weighting (GLOW), an adaptive, sample-aware scheme that exploits such distinctive training dynamics by rescaling per-sample loss based on inter-epoch progress. Empirically, GLOW efficiently leverages unfiltered trajectories, yielding a 5.51% OOD gain over positive-only SFT on Qwen2.5-7B and boosting MMLU from 72.82% to 76.47% as an RL initialization.
Problem

Research questions and friction points this paper is trying to address.

out-of-domain generalization
negative reasoning samples
supervised fine-tuning
chain-of-thought
overfitting
Innovation

Methods, ideas, or system contributions that make the work stand out.

Negative Reasoning Samples
Out-of-Domain Generalization
Chain-of-Thought
Loss Weighting
Policy Entropy
🔎 Similar Papers
No similar papers found.
Xueyun Tian
Xueyun Tian
Institute of Computing Technology
Multimodal GenerationMLLM
Minghua Ma
Minghua Ma
Microsoft
AIOpsCloud Intelligence
Bingbing Xu
Bingbing Xu
Associate professor, Institute of Computing Technology, Chinese Academy of Sciences
Graph Neural NetworksNetwork Embedding
N
Nuoyan Lyu
CAS Key Laboratory of AI Safety, Institute of Computing Technology, CAS, Beijing, China
H
Heng Dong
Tsinghua University, Beijing, China
Zheng Chu
Zheng Chu
Harbin Institute of Technology
Natural Language Processing
Y
Yuanzhuo Wang
CAS Key Laboratory of AI Safety, Institute of Computing Technology, CAS, Beijing, China
H
Huawei Shen
CAS Key Laboratory of AI Safety, Institute of Computing Technology, CAS, Beijing, China
Wei Li
Wei Li
Institute of Computing Technology, Chinese Academy of Sciences
computer