🤖 AI Summary
This work addresses a critical yet overlooked bottleneck in vocabulary expansion for language models: standard mean initialization of new tokens—such as semantic IDs in recommender systems—often collapses their embeddings into a degenerate subspace, impeding effective discrimination even after fine-tuning. To resolve this, the authors propose Grounded Token Initialization (GTI), a method that leverages paired linguistic supervision to lightly map new tokens into semantically meaningful and well-separated regions of the pretrained embedding space prior to fine-tuning. Through spectral and geometric analyses, the study systematically demonstrates that proper initialization is pivotal for successful vocabulary extension. Evaluated across multiple generative recommendation benchmarks—including industrial-scale and public datasets—GTI substantially outperforms mean initialization and existing adaptation strategies, while preserving rich embedding structure post-fine-tuning.
📝 Abstract
Language models (LMs) are increasingly extended with new learnable vocabulary tokens for domain-specific tasks, such as Semantic-ID tokens in generative recommendation. The standard practice initializes these new tokens as the mean of existing vocabulary embeddings, then relies on supervised fine-tuning to learn their representations. We present a systematic analysis of this strategy: through spectral and geometric diagnostics, we show that mean initialization collapses all new tokens into a degenerate subspace, erasing inter-token distinctions that subsequent fine-tuning struggles to fully recover. These findings suggest that \emph{token initialization} is a key bottleneck when extending LMs with new vocabularies. Motivated by this diagnosis, we propose the \emph{Grounded Token Initialization Hypothesis}: linguistically grounding novel tokens in the pretrained embedding space before fine-tuning better enables the model to leverage its general-purpose knowledge for novel-token domains. We operationalize this hypothesis as GTI (Grounded Token Initialization), a lightweight grounding stage that, prior to fine-tuning, maps new tokens to distinct, semantically meaningful locations in the pretrained embedding space using only paired linguistic supervision. Despite its simplicity, GTI outperforms both mean initialization and existing auxiliary-task adaptation methods in the majority of evaluation settings across multiple generative recommendation benchmarks, including industry-scale and public datasets. Further analyses show that grounded embeddings produce richer inter-token structure that persists through fine-tuning, corroborating the hypothesis that initialization quality is a key bottleneck in vocabulary extension.