TIP: Token Importance in On-Policy Distillation

πŸ“… 2026-04-15
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

179K/year
πŸ€– AI Summary
Existing policy distillation methods inadequately model token importance, limiting both training efficiency and performance. This work proposes the Token Importance-aware Policy (TIP) framework, which introduces a two-dimensional classification scheme based on the student model’s entropy and the KL divergence between teacher and student predictions. This scheme identifies two critical token regions: high-entropy tokens and low-entropy tokens exhibiting high teacher-student disagreement. Building upon this insight, TIP employs a type-aware hierarchical sampling strategy that overcomes the limitations of conventional entropy-only approaches. Evaluated on benchmarks including MATH-500, AIME 2024/2025, and DeepPlanning, TIP achieves superior performance using fewer than 20% of the tokens required by full-sequence training while reducing peak memory consumption by up to 47%.

Technology Category

Application Category

πŸ“ Abstract
On-policy knowledge distillation (OPD) trains a student on its own rollouts under token-level supervision from a teacher. Not all token positions matter equally, but existing views of token importance are incomplete. We ask a direct question: which tokens carry the most useful learning signal in OPD? Our answer is that informative tokens come from two regions: positions with high student entropy, and positions with low student entropy plus high teacher--student divergence, where the student is overconfident and wrong. Empirically, student entropy is a strong first-order proxy: retaining $50\%$ of tokens with entropy-based sampling matches or exceeds all-token training while reducing peak memory by up to $47\%$. But entropy alone misses a second important region. When we isolate low-entropy, high-divergence tokens, training on fewer than $10\%$ of all tokens nearly matches full-token baselines, showing that overconfident tokens carry dense corrective signal despite being nearly invisible to entropy-only rules. We organize these findings with TIP (Token Importance in on-Policy distillation), a two-axis taxonomy over student entropy and teacher--student divergence, and give a theoretical explanation for why entropy is useful yet structurally incomplete. This view motivates type-aware token selection rules that combine uncertainty and disagreement. We validate this picture across three teacher--student pairs spanning Qwen3, Llama, and Qwen2.5 on MATH-500 and AIME 2024/2025, and on the DeepPlanning benchmark for long-horizon agentic planning, where Q3-only training on $<$$20\%$ of tokens surpasses full-token OPD. Our experiments are implemented by extending the OPD repository https://github.com/HJSang/OPSD_OnPolicyDistillation, which supports memory-efficient distillation of larger models under limited GPU budgets.
Problem

Research questions and friction points this paper is trying to address.

on-policy distillation
token importance
student entropy
teacher-student divergence
knowledge distillation
Innovation

Methods, ideas, or system contributions that make the work stand out.

on-policy distillation
token importance
student entropy
teacher-student divergence
memory-efficient training
πŸ”Ž Similar Papers
No similar papers found.