Cross-Tokenizer Distillation via Approximate Likelihood Matching

📅 2025-03-25

📈 Citations: 0

✨ Influential: 0

career value

148K/year

🤖 AI Summary

Existing knowledge distillation methods require teacher and student models to share a tokenizer, severely limiting their applicability to cross-architecture model transfer. This work proposes the first tokenizer-agnostic, purely distillation-based framework, which directly aligns the output distributions of teacher and student models under a target tokenizer via approximate likelihood matching—eliminating reliance on next-token prediction loss or tokenizer sharing. Our method optimizes KL divergence between token-level logits, enabling effective transfer across heterogeneous tokenization schemes (e.g., subword-based Llama/Gemma tokenizers and byte-level tokenizers) and facilitating probabilistic ensembling of heterogeneous models under a unified tokenizer. Experiments demonstrate that our approach significantly outperforms baselines in distilling Llama and Gemma models into byte-level tokenization settings; moreover, it successfully transfers mathematical reasoning capabilities from specialized large models to compact students, achieving competitive performance on mathematical problem-solving benchmarks.

Technology Category

Application Category

📝 Abstract

Distillation has shown remarkable success in transferring knowledge from a Large Language Model (LLM) teacher to a student LLM. However, current distillation methods predominantly require the same tokenizer between the teacher and the student, restricting their applicability to only a small subset of teacher-student pairs. In this work, we develop a cross-tokenizer distillation method to solve this crucial deficiency. Our method is the first to enable cross-tokenizer distillation without a next-token prediction loss as the main objective, instead purely maximizing the student predictions' similarity to the teacher's predictions (known as pure distillation), while also being robust to large mismatches between the teacher and the student tokenizer function and vocabulary. Empirically, our method enables substantially improved performance as tested on two use cases. First, we show that viewing tokenizer transfer as self-distillation enables unprecedently effective transfer across tokenizers. We transfer (subword-level) Llama and Gemma models to byte-level tokenization more effectively than prior methods transfer to a similar subword tokenizer under a comparable training budget. Transferring different base models to the same tokenizer also enables ensembling them (e.g., via averaging their predicted probabilities) which boosts performance. Second, we use our cross-tokenizer distillation method to distil a large maths-specialized LLM into a smaller model, achieving competitive maths problem-solving performance. Overall, our results make substantial strides toward better adaptability and enhanced interaction between different LLMs.

Problem

Research questions and friction points this paper is trying to address.

Enables cross-tokenizer distillation without next-token prediction loss

Improves transfer across tokenizers via self-distillation

Distills large specialized LLMs into smaller competitive models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-tokenizer distillation without next-token prediction loss

Pure distillation maximizing student-teacher prediction similarity

Robust to large tokenizer and vocabulary mismatches

🔎 Similar Papers

Towards Cross-Tokenizer Distillation: the Universal Logit Distillation Loss for LLMs