π€ AI Summary
This work addresses the challenges of efficiency and catastrophic forgetting in post-training large language models for enhanced reasoning. It proposes Surgical Post-Training (SPoT), a method that leverages an oracle to minimally edit erroneous reasoning steps, generating high-quality correction data aligned with the modelβs distribution. Reasoning correctness is framed as a binary classification problem and optimized via a reward-based binary cross-entropy objective. SPoT reveals an implicit regularization mechanism in DPO reward estimation, enabling effective training with only a small number of precise correction samples while circumventing the reliance on relative preference rankings inherent in conventional preference optimization. Experiments demonstrate that fine-tuning Qwen3-8B with just 4k math correction examples for 28 minutes on 8ΓH800 GPUs yields a 6.2% average accuracy improvement across both in-domain and out-of-domain tasks.
π Abstract
Enhancing the reasoning capabilities of Large Language Models (LLMs) via post-training is often constrained by the trade-off between efficiency and catastrophic forgetting. While prior research emphasizes the role of on-policy data in mitigating forgetting, we uncover--and validate both theoretically and empirically--an overlooked yet critical mechanism: the implicit regularization inherent in Direct Preference Optimization's (DPO) reward estimate. This motivates our Surgical Post-Training (SPoT), a new paradigm designed to optimize reasoning efficiently while preserving learned prior knowledge. SPoT consists of: (1) a data rectification pipeline that employs an Oracle to surgically correct erroneous steps via minimal edits, generating data proximal to the model's distribution; and (2) a reward-based binary cross-entropy objective. Unlike the relative ranking in DPO, this objective treats reasoning correctness as a binary classification problem, enforcing decoupled supervision signals. Empirically, with only 4k rectified math data pairs, SPoT improves Qwen3-8B's accuracy by 6.2% on average across in-domain and OOD tasks, requiring merely 28 minutes of training on 8x H800 GPUs. Code: https://github.com/Visual-AI/SPoT