In Good GRACEs: Principled Teacher Selection for Knowledge Distillation

📅 2025-11-04

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

To address the costly trial-and-error process in teacher model selection for knowledge distillation, this paper proposes GRACE—a lightweight, zero-shot teacher adaptability scoring method that requires no access to teacher outputs, test data, or internal parameters. Methodologically, GRACE establishes, for the first time, a theoretical connection between student model gradient distribution characteristics and leave-one-out stability of gradient-based optimization—grounded in information-theoretic principles. It constructs a scoring metric solely from student gradients computed on training data, integrating entropy-based information measures with stability quantification. On GSM8K and MATH benchmarks, GRACE achieves a Spearman correlation of 86% with final student performance. Compared to random or heuristic teacher selection, GRACE improves distillation accuracy by up to 7.4%, substantially reducing teacher candidate evaluation overhead while preserving fidelity to downstream task performance.

Technology Category

Application Category

📝 Abstract

Knowledge distillation is an efficient strategy to use data generated by large"teacher"language models to train smaller capable"student"models, but selecting the optimal teacher for a specific student-task combination requires expensive trial-and-error. We propose a lightweight score called GRACE to quantify how effective a teacher will be for post-training a student model. GRACE measures distributional properties of the student's gradients without access to a verifier, teacher logits, teacher internals, or test data. From an information-theoretic perspective, GRACE connects to leave-one-out stability of gradient-based algorithms, which controls the generalization performance of the distilled students. On GSM8K and MATH, GRACE correlates strongly (up to 86% Spearman correlation) with the performance of the distilled LLaMA and OLMo students. In particular, training a student using the GRACE-selected teacher can improve the performance by up to 7.4% over naively using the best-performing teacher. Further, GRACE can provide guidance on crucial design choices in distillation, including (1) the best temperature to use when generating from the teacher, (2) the best teacher to use given a size constraint, and (3) the best teacher to use within a specific model family. Altogether, our findings demonstrate that GRACE can efficiently and effectively identify a strongly compatible teacher for a given student and provide fine-grained guidance on how to perform distillation.

Problem

Research questions and friction points this paper is trying to address.

Selecting optimal teacher models for knowledge distillation without trial-and-error

Quantifying teacher effectiveness for student training through gradient distribution analysis

Providing guidance on distillation design choices like temperature and model selection

Innovation

Methods, ideas, or system contributions that make the work stand out.

GRACE score quantifies teacher effectiveness for distillation

GRACE measures student gradient properties without verifier

GRACE connects to gradient algorithm stability theoretically

🔎 Similar Papers

Classroom-Inspired Multi-Mentor Distillation with Adaptive Learning Strategies