Cornerstones or Stumbling Blocks? Deciphering the Rock Tokens in On-Policy Distillation

📅 2026-05-09

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This study investigates the persistent presence of a class of high-loss yet performance-irrelevant tokens—dubbed “Rock Tokens”—in outcome-based policy distillation (OPD), despite training convergence. Through per-token KL divergence analysis, gradient norm evaluation, and causal intervention experiments, the work identifies and formally characterizes these tokens, revealing that they consume substantial gradient updates yet remain poorly absorbed by the student model, thereby acting as an optimization bottleneck rather than a performance driver. Building on this insight, the paper proposes a novel non-uniform weighting distillation paradigm: selectively ignoring Rock Tokens—which can constitute up to 18% of generated outputs—significantly enhances distillation efficiency without compromising the student model’s reasoning capabilities.

📝 Abstract

While recent work in Reinforcement Learning with Verifiable Rewards (RLVR) has shown that a small subset of critical tokens disproportionately drives reasoning gains, an analogous token-level understanding of On-Policy Distillation (OPD) remains largely unexplored. In this work, we investigate high-loss tokens, a token type that--as the most direct signal of student-teacher mismatch under OPD's per-token KL objective--should progressively diminish as training converges according to existing studies; however, our empirical analysis shows otherwise. Even after OPD training reaches apparent saturation, a substantial subset of tokens continues to exhibit persistently high loss; these tokens, which we term Rock Tokens, can account for up to 18\% of the tokens in generated outputs. Our investigation reveals two startling paradoxes. First, despite their high occurrence frequency providing a disproportionately large share of total gradient norms, Rock Tokens themselves remain stagnant throughout training, resisting teacher-driven corrections. Second, through causal intervention, we find that these tokens provide negligible functional contribution to the model's actual reasoning performance. These findings suggest that a vast amount of optimization bandwidth is spent on structural and discourse residuals that the student model cannot or need not internalize. By deconstructing these dynamics, we demonstrate that strategically bypassing these ``stumbling blocks'' can significantly streamline the alignment process, challenging the necessity of uniform token weighting and offering a more efficient paradigm for large-scale model distillation.

Problem

Research questions and friction points this paper is trying to address.

On-Policy Distillation

Rock Tokens

high-loss tokens

student-teacher mismatch

token-level analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Rock Tokens

On-Policy Distillation

token-level analysis