Training-Trajectory-Aware Token Selection

📅 2026-01-15

📈 Citations: 1

✨ Influential: 0

career value

172K/year

🤖 AI Summary

In continual distillation scenarios where the student model already exhibits strong reasoning capabilities, performance often plateaus or even degrades due to a bottleneck phenomenon during training. This work identifies, for the first time, that this issue stems from an adversarial relationship between Imitation-Anchor Tokens and target tokens, caused by token-level confidence divergence. To address this, we propose a training-trajectory-aware dynamic token selection mechanism that restructures the distillation objective to unblock the optimization path. Combined with sample-efficient fine-tuning, our approach achieves substantial performance gains with only hundreds of examples: Qwen3-8B surpasses DeepSeek-R1, Qwen3-32B approaches Qwen3-235B, and LLaDA-2.0-Mini attains state-of-the-art results among 16B-scale non-thinking models.

Technology Category

Application Category

📝 Abstract

Efficient distillation is a key pathway for converting expensive reasoning capability into deployable efficiency, yet in the frontier regime where the student already has strong reasoning ability, naive continual distillation often yields limited gains or even degradation. We observe a characteristic training phenomenon: even as loss decreases monotonically, all performance metrics can drop sharply at almost the same bottleneck, before gradually recovering. We further uncover a token-level mechanism: confidence bifurcates into steadily increasing Imitation-Anchor Tokens that quickly anchor optimization and other yet-to-learn tokens whose confidence is suppressed until after the bottleneck. And the characteristic that these two types of tokens cannot coexist is the root cause of the failure in continual distillation. To this end, we propose Training-Trajectory-Aware Token Selection (T3S) to reconstruct the training objective at the token level, clearing the optimization path for yet-to-learn tokens. T3 yields consistent gains in both AR and dLLM settings: with only hundreds of examples, Qwen3-8B surpasses DeepSeek-R1 on competitive reasoning benchmarks, Qwen3-32B approaches Qwen3-235B, and T3-trained LLaDA-2.0-Mini exceeds its AR baseline, achieving state-of-the-art performance among all of 16B-scale no-think models.

Problem

Research questions and friction points this paper is trying to address.

continual distillation

training bottleneck

token-level optimization

reasoning capability

model distillation

Innovation

Methods, ideas, or system contributions that make the work stand out.

token selection

knowledge distillation

training trajectory