🤖 AI Summary
English-dominant large language models exhibit high token fertility and suboptimal inference efficiency on Italian. To address this, we propose Semantic-Aligned Vocabulary Adaptation (SAVA), the first method to jointly enforce semantic consistency constraints and differentiable vocabulary replacement for cross-lingual vocabulary optimization. SAVA employs neural mapping to achieve efficient monolingual adaptation without compromising multi-task generalization. Experiments demonstrate that SAVA reduces token fertility by 25% for Mistral-7B on Italian, compresses the Llama-3.1-8B vocabulary—eliminating approximately one billion parameters—and restores downstream performance with minimal Italian-language continual training. The adapted models retain competitiveness on multiple-choice and generative tasks. This work establishes a novel paradigm for efficient LLM adaptation to low-resource languages.
📝 Abstract
The number of pretrained Large Language Models (LLMs) is increasing steadily, though the majority are designed predominantly for the English language. While state-of-the-art LLMs can handle other languages, due to language contamination or some degree of multilingual pretraining data, they are not optimized for non-English languages, leading to inefficient encoding (high token"fertility") and slower inference speed. In this work, we thoroughly compare a variety of vocabulary adaptation techniques for optimizing English LLMs for the Italian language, and put forward Semantic Alignment Vocabulary Adaptation (SAVA), a novel method that leverages neural mapping for vocabulary substitution. SAVA achieves competitive performance across multiple downstream tasks, enhancing grounded alignment strategies. We adapt two LLMs: Mistral-7b-v0.1, reducing token fertility by 25%, and Llama-3.1-8B, optimizing the vocabulary and reducing the number of parameters by 1 billion. We show that, following the adaptation of the vocabulary, these models can recover their performance with a relatively limited stage of continual training on the target language. Finally, we test the capabilities of the adapted models on various multi-choice and generative tasks.