Self-Vocabularizing Training for Neural Machine Translation

📅 2025-03-18

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Existing NMT vocabulary learning methods rely on pretraining statistics and entropy assumptions, neglecting the model’s actual token selection preferences during training—leading to misalignment between the BPE vocabulary and its empirically used subset, thereby degrading translation performance. To address this, we propose an iterative self-vocabularization training framework: it generates pseudo-labels via self-training and dynamically analyzes token-wise entropy shifts and contribution scores to automatically prune and reinitialize a smaller, more optimal subvocabulary. We are the first to identify and formally model vocabulary-induced shift—a phenomenon wherein token distributions evolve systematically during training. Our approach introduces a fully end-to-end, human-free self-vocabularization mechanism. Empirically, we find that deeper architectures naturally compress vocabulary size by 6–8% while enhancing token diversity. On standard benchmarks, our method achieves up to +1.49 BLEU, accelerates convergence, and improves generalization.

Technology Category

Application Category

📝 Abstract

Past vocabulary learning techniques identify relevant vocabulary before training, relying on statistical and entropy-based assumptions that largely neglect the role of model training. Empirically, we observe that trained translation models are induced to use a byte-pair encoding (BPE) vocabulary subset distinct from the original BPE vocabulary, leading to performance improvements when retrained with the induced vocabulary. In this paper, we analyze this discrepancy in neural machine translation by examining vocabulary and entropy shifts during self-training--where each iteration generates a labeled dataset by pairing source sentences with the model's predictions to define a new vocabulary. Building on these insights, we propose self-vocabularizing training, an iterative method that self-selects a smaller, more optimal vocabulary, yielding up to a 1.49 BLEU improvement. Moreover, we find that deeper model architectures lead to both an increase in unique token usage and a 6-8% reduction in vocabulary size.

Problem

Research questions and friction points this paper is trying to address.

Analyzes vocabulary and entropy shifts in neural machine translation.

Proposes self-vocabularizing training for optimal vocabulary selection.

Improves translation performance with smaller, more efficient vocabularies.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-vocabularizing training iteratively optimizes vocabulary.

Byte-pair encoding subset improves translation performance.

Deeper models reduce vocabulary size by 6-8%.

🔎 Similar Papers

Large Vocabulary Size Improves Large Language Models