Adapters for Altering LLM Vocabularies: What Languages Benefit the Most?

📅 2024-10-12

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

162K/year

🤖 AI Summary

This work addresses the challenges of vocabulary expansion in pretrained language models—specifically, difficulty in extending the vocabulary and severe subword fragmentation for novel tokens. We propose VocADT, a parameter-free vocabulary adaptation method leveraging trainable adapters to dynamically reconstruct representations of unseen words via optimal linear combinations of existing word embeddings—without modifying original model weights or relying on external embeddings. To our knowledge, this is the first application of adapter mechanisms to vocabulary extension, enabling script- and resource-agnostic adaptation. Comprehensive evaluation across 11 languages demonstrates consistent and significant improvements over the base Mistral model and leading baselines on both natural language understanding (NLU) and machine translation (MT) tasks. VocADT exhibits stable gains after fine-tuning and achieves state-of-the-art performance on generative tasks, particularly for highly fragmented languages such as Thai and Arabic.

Technology Category

Application Category

📝 Abstract

Vocabulary adaptation, which integrates new vocabulary into pre-trained language models, enables expansion to new languages and mitigates token over-fragmentation. However, existing approaches are limited by their reliance on heuristics or external embeddings. We propose VocADT, a novel method for vocabulary adaptation using adapter modules that are trained to learn the optimal linear combination of existing embeddings while keeping the model's weights fixed. VocADT offers a flexible and scalable solution without depending on external resources or language constraints. Across 11 languages-with diverse scripts, resource availability, and fragmentation-we demonstrate that VocADT outperforms the original Mistral model and other baselines across various multilingual tasks including natural language understanding and machine translation. We find that Latin-script languages and highly fragmented languages benefit the most from vocabulary adaptation. We further fine-tune the adapted model on the generative task of machine translation and find that vocabulary adaptation is still beneficial after fine-tuning and that VocADT is the most effective.

Problem

Research questions and friction points this paper is trying to address.

Enables expansion to new languages via vocabulary adaptation.

Mitigates token over-fragmentation in pre-trained language models.

Proposes VocADT for optimal embedding combination without external resources.

Innovation

Methods, ideas, or system contributions that make the work stand out.

VocADT uses adapter modules for vocabulary adaptation.

VocADT learns optimal linear combination of embeddings.

VocADT outperforms baselines in multilingual tasks.

🔎 Similar Papers

Large Vocabulary Size Improves Large Language Models