Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation

📅 2025-04-23

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

English-dominant large language models exhibit high token fertility and suboptimal inference efficiency on Italian. To address this, we propose Semantic-Aligned Vocabulary Adaptation (SAVA), the first method to jointly enforce semantic consistency constraints and differentiable vocabulary replacement for cross-lingual vocabulary optimization. SAVA employs neural mapping to achieve efficient monolingual adaptation without compromising multi-task generalization. Experiments demonstrate that SAVA reduces token fertility by 25% for Mistral-7B on Italian, compresses the Llama-3.1-8B vocabulary—eliminating approximately one billion parameters—and restores downstream performance with minimal Italian-language continual training. The adapted models retain competitiveness on multiple-choice and generative tasks. This work establishes a novel paradigm for efficient LLM adaptation to low-resource languages.

Technology Category

Application Category

📝 Abstract

The number of pretrained Large Language Models (LLMs) is increasing steadily, though the majority are designed predominantly for the English language. While state-of-the-art LLMs can handle other languages, due to language contamination or some degree of multilingual pretraining data, they are not optimized for non-English languages, leading to inefficient encoding (high token"fertility") and slower inference speed. In this work, we thoroughly compare a variety of vocabulary adaptation techniques for optimizing English LLMs for the Italian language, and put forward Semantic Alignment Vocabulary Adaptation (SAVA), a novel method that leverages neural mapping for vocabulary substitution. SAVA achieves competitive performance across multiple downstream tasks, enhancing grounded alignment strategies. We adapt two LLMs: Mistral-7b-v0.1, reducing token fertility by 25%, and Llama-3.1-8B, optimizing the vocabulary and reducing the number of parameters by 1 billion. We show that, following the adaptation of the vocabulary, these models can recover their performance with a relatively limited stage of continual training on the target language. Finally, we test the capabilities of the adapted models on various multi-choice and generative tasks.

Problem

Research questions and friction points this paper is trying to address.

Optimizing English LLMs for Italian language efficiency

Reducing token fertility in non-English LLM processing

Enhancing vocabulary adaptation for better multilingual performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic Alignment Vocabulary Adaptation for Italian

Reduces token fertility by 25 percent

Optimizes vocabulary and reduces parameters

🔎 Similar Papers

Large Vocabulary Size Improves Large Language Models