🤖 AI Summary
While existing external reasoning harnesses can improve large language models’ performance on complex tasks, they fail to enhance the models’ intrinsic reasoning capabilities. This work proposes On-Policy Harness Self-Distillation (OPHSD), a novel approach that, for the first time, leverages the inference-time harness as a dynamic teacher signal during training. By aligning policy trajectories through self-distillation, OPHSD internalizes harness-guided reasoning into the base model itself. The method integrates harness-enhanced strategies—such as draft–verify and plan–solve—with supervised fine-tuning, enabling the model to retain high performance without relying on external harnesses at inference time. Experiments demonstrate that OPHSD significantly outperforms strong baselines, achieving a 10.83% improvement over OPSD on benchmarks like HMMT25, thereby validating that harnesses can serve as temporary scaffolds during training rather than permanent dependencies.
📝 Abstract
Inference-time harnesses substantially improve large language models on complex reasoning tasks. However, the intrinsic capabilities of the underlying model remain unchanged by the addition of these external workflows. To bridge this gap, we introduce \emph{On-Policy Harness Self-Distillation} (OPHSD), which employs the harness-augmented current model as a teacher for self-distillation, thereby introducing extra supervisory signals from the harness beyond training data. OPHSD internalizes task-specific harness capabilities into the student model, yielding robust generalizability and strong standalone performance across diverse reasoning tasks. Evaluated across draft--verify harness for text classification and plan--solve for mathematical reasoning tasks, OPHSD consistently outperforms strong baselines (e.g., +10.83\% over OPSD on HMMT25). Our analysis further indicates that reattaching the harness during inference yields no additional benefits and can even degrade performance, suggesting that complex harnesses need not always be permanent fixtures; instead, they can serve as temporary training scaffolds whose benefits are permanently fed back into the base model. Our code and training data are available at https://github.com/zzy1127/OPHSD-On-Policy-Harness-Self-Distillation.