π€ AI Summary
This work addresses a key limitation in knowledge distillation for large language model reasoning: existing methods often rely on teacher model strength or student likelihood over trajectories, which poorly reflect the true learning value of those trajectories. To overcome this, the authors propose the Rank-Surprisal Ratio (RSR), a novel metric that jointly considers the student modelβs token ranking and negative log-likelihood, thereby balancing behavioral alignment with informational content to identify high-quality distillation data. Experiments across five student models and eleven sets of teacher-generated trajectories demonstrate that RSR exhibits a strong correlation with post-distillation performance (average Spearman correlation of 0.86), significantly outperforming existing metrics and effectively guiding both trajectory selection and teacher model choice.
π Abstract
Long chain-of-thought (CoT) trajectories provide rich supervision signals for distilling reasoning from teacher to student LLMs. However, both prior work and our experiments show that trajectories from stronger teachers do not necessarily yield better students, highlighting the importance of data-student suitability in distillation. Existing methods assess suitability primarily through student likelihood, favoring trajectories that align closely with the student model's current behavior but overlooking more informative ones. Addressing this, we propose Rank-Surprisal Ratio (RSR), a simple metric that captures both alignment and informativeness to assess the suitability of a reasoning trajectory. RSR is motivated by the observation that effective trajectories typically balance learning signal strength and behavioral alignment by combining low absolute probability with relatively high-ranked tokens under the student model. Concretely, RSR is defined as the ratio of a trajectory's average token-wise rank to its average negative log-likelihood, and is straightforward to compute and interpret. Across five student models and reasoning trajectories from 11 diverse teachers, RSR strongly correlates with post-training reasoning performance (average Spearman 0.86), consistently outperforming existing metrics. We further demonstrate its practical utility in both trajectory selection and teacher selection.