AweDist: Attention-aware Embedding Distillation for New Input Token Embeddings

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

143K/year

🤖 AI Summary

Language models rely on static vocabularies, leading to degraded performance and increased computational overhead in low-resource domains. Existing approaches for initializing embeddings of out-of-vocabulary (OOV) tokens require additional pretraining or fine-tuning—entailing high cost and poor scalability. This paper proposes a zero-shot, lightweight OOV token embedding initialization method. It is the first to introduce attention-aware representation distillation for OOV embedding generation, leveraging cross-granularity attention-guided embedding distillation, intermediate-layer representation transfer under original tokenizer segmentation, and scalable soft-target alignment—all without any retraining. Evaluated on multiple open-source large language models, our method significantly outperforms strong baselines: OOV embeddings converge faster, yield substantial downstream task improvements, and incur near-zero computational overhead.

Technology Category

Application Category

📝 Abstract

Current language models rely on static vocabularies determined at pretraining time, which can lead to decreased performance and increased computational cost for domains underrepresented in the original vocabulary. New tokens can be added to solve this problem, when coupled with a good initialization for their new embeddings. However, existing embedding initialization methods either require expensive further training or pretraining of additional modules. In this paper, we propose AweDist and show that by distilling representations obtained using the original tokenization, we can quickly learn high-quality input embeddings for new tokens. Experimental results with a wide range of open-weight models show that AweDist is able to outperform even strong baselines.

Problem

Research questions and friction points this paper is trying to address.

Static vocabularies reduce performance for underrepresented domains

Existing embedding initialization methods are costly or require pretraining

AweDist learns high-quality embeddings for new tokens efficiently

Innovation

Methods, ideas, or system contributions that make the work stand out.

Distills original tokenization representations for new embeddings

Quickly learns high-quality input embeddings for new tokens

Outperforms strong baselines without expensive training

🔎 Similar Papers

No similar papers found.