Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation

📅 2026-04-02

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

This work addresses a critical yet overlooked bottleneck in vocabulary expansion for language models: standard mean initialization of new tokens—such as semantic IDs in recommender systems—often collapses their embeddings into a degenerate subspace, impeding effective discrimination even after fine-tuning. To resolve this, the authors propose Grounded Token Initialization (GTI), a method that leverages paired linguistic supervision to lightly map new tokens into semantically meaningful and well-separated regions of the pretrained embedding space prior to fine-tuning. Through spectral and geometric analyses, the study systematically demonstrates that proper initialization is pivotal for successful vocabulary extension. Evaluated across multiple generative recommendation benchmarks—including industrial-scale and public datasets—GTI substantially outperforms mean initialization and existing adaptation strategies, while preserving rich embedding structure post-fine-tuning.

Technology Category

Application Category

📝 Abstract

Language models (LMs) are increasingly extended with new learnable vocabulary tokens for domain-specific tasks, such as Semantic-ID tokens in generative recommendation. The standard practice initializes these new tokens as the mean of existing vocabulary embeddings, then relies on supervised fine-tuning to learn their representations. We present a systematic analysis of this strategy: through spectral and geometric diagnostics, we show that mean initialization collapses all new tokens into a degenerate subspace, erasing inter-token distinctions that subsequent fine-tuning struggles to fully recover. These findings suggest that \emph{token initialization} is a key bottleneck when extending LMs with new vocabularies. Motivated by this diagnosis, we propose the \emph{Grounded Token Initialization Hypothesis}: linguistically grounding novel tokens in the pretrained embedding space before fine-tuning better enables the model to leverage its general-purpose knowledge for novel-token domains. We operationalize this hypothesis as GTI (Grounded Token Initialization), a lightweight grounding stage that, prior to fine-tuning, maps new tokens to distinct, semantically meaningful locations in the pretrained embedding space using only paired linguistic supervision. Despite its simplicity, GTI outperforms both mean initialization and existing auxiliary-task adaptation methods in the majority of evaluation settings across multiple generative recommendation benchmarks, including industry-scale and public datasets. Further analyses show that grounded embeddings produce richer inter-token structure that persists through fine-tuning, corroborating the hypothesis that initialization quality is a key bottleneck in vocabulary extension.

Problem

Research questions and friction points this paper is trying to address.

token initialization

language models

generative recommendation

vocabulary extension

embedding collapse

Innovation

Methods, ideas, or system contributions that make the work stand out.

Grounded Token Initialization

vocabulary extension

embedding initialization