Overcoming Vocabulary Mismatch: Vocabulary-agnostic Teacher Guided Language Modeling

📅 2025-03-24

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

To address token misalignment and output distribution mismatch arising from vocabulary heterogeneity between teacher and student language models, this paper proposes a vocabulary-agnostic teacher-guided knowledge distillation framework. Methodologically, it introduces (1) a token-level lexical alignment mechanism enabling fine-grained cross-vocabulary semantic matching, and (2) a teacher-guided loss function that directly optimizes the student’s output distribution to align with the teacher’s—eliminating reliance on shared vocabulary. Evaluated on TinyLlama-1B (student) and Qwen2.5-Math-Instruct-7B (teacher), which share only 6% vocabulary overlap, our method improves downstream task performance by 46%, substantially outperforming continuous pretraining baselines. To our knowledge, this is the first framework enabling effective knowledge transfer between arbitrarily heterogeneous teacher–student vocabulary pairs, establishing a new paradigm for efficient and flexible large-model distillation.

Technology Category

Application Category

📝 Abstract

Using large teacher models to guide the training of smaller student models has become the prevailing paradigm for efficient and effective learning. However, vocabulary mismatches between teacher and student language models pose significant challenges in language modeling, resulting in divergent token sequences and output distributions. To overcome these limitations, we propose Vocabulary-agnostic Teacher Guided Language Modeling (VocAgnoLM), a novel approach that bridges the gap caused by vocabulary mismatch through two key methods: (1) Token-level Lexical Alignment, which aligns token sequences across mismatched vocabularies, and (2) Teacher Guided Loss, which leverages the loss of teacher model to guide effective student training. We demonstrate its effectiveness in language modeling with 1B student model using various 7B teacher models with different vocabularies. Notably, with Qwen2.5-Math-Instruct, a teacher model sharing only about 6% of its vocabulary with TinyLlama, VocAgnoLM achieves a 46% performance improvement compared to naive continual pretraining. Furthermore, we demonstrate that VocAgnoLM consistently benefits from stronger teacher models, providing a robust solution to vocabulary mismatches in language modeling.

Problem

Research questions and friction points this paper is trying to address.

Addressing vocabulary mismatch in teacher-student language models

Aligning token sequences across different vocabularies

Improving student model performance with vocabulary-agnostic training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vocabulary-agnostic teacher guided modeling

Token-level lexical alignment method

Teacher guided loss optimization

🔎 Similar Papers

Babysit A Language Model From Scratch: Interactive Language Learning by Trials and Demonstrations