🤖 AI Summary
This work addresses critical issues in full-sequence self-distillation for reinforcement learning—namely redundant gradients, leakage of privileged information, entropy inflation, and degradation of long-range reasoning capabilities. To mitigate these problems, the authors propose TRACE, a targeted distillation approach applied only to annotated key reasoning segments. Specifically, forward KL divergence is used at critical positions along correct trajectories, optionally complemented by reverse KL on erroneous segments, while all other tokens retain standard GRPO updates. A token-routing-based local distillation mechanism combined with a key-segment mask effectively limits total exposure of privileged information. Furthermore, the KL weight is decayed after an initial warm-up phase. Evaluated across four mathematical benchmarks and GPQA-Diamond, TRACE yields an average improvement of 2.76 percentage points, maintains a 1.90-point gain under online self-labeling, and preserves the base Qwen3-8B model’s original out-of-distribution performance.
📝 Abstract
On-policy self-distillation (self-OPD) densifies reinforcement learning with verifiable rewards (RLVR) by letting a policy teach itself under privileged context. We find that when this guidance spans the full response, all-token KL spends gradients on mostly redundant positions and amplifies privileged-information leakage, causing entropy rise, shortened reasoning, and out-of-distribution degradation in long-horizon math training. We propose Token-Routed Alignment for Critical rEasoning (TRACE), which distills only on annotator-marked critical spans: forward KL on key spans of correct rollouts, optional reverse KL on localized error spans, and GRPO on all remaining tokens, with the KL channel annealed away after a short warm-up. Our analysis explains TRACE through two effects: forward KL provides non-vanishing lift to teacher-supported tokens that the student under-allocates, while span masking and decay keep cumulative privileged-gradient exposure finite. On four held-out math benchmarks plus GPQA-Diamond, TRACE improves over GRPO by 2.76 percentage points on average and preserves the Qwen3-8B base OOD score on GPQA-Diamond, where GRPO and all-token self-OPD baselines degrade. Gains persist under online self-annotation (+1.90 percentage points, about 69% of the strong-API gain), reducing the concern that TRACE merely imports external annotator capability. Across scales, the best routed action is base-dependent: on Qwen3-8B it is forward KL on key spans, while on Qwen3-1.7B it shifts to reverse KL on error spans.