MoCE: Adaptive Mixture of Contextualization Experts for Byte-based Neural Machine Translation

๐Ÿ“… 2024-11-03
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Byte-level neural machine translation (NMT) suffers from semantic sparsity and cross-lingual encoding heterogeneity, hindering effective contextual modeling. Method: We propose a Contextualized Mixture of Experts (MoCE) mechanism for the Transformer architecture, introducing gated mixture attention wherein each attention head is dynamically treated as a domain-specific expert; experts are adaptively selected and weighted based on Unicode byte features and local contextโ€”requiring no manual hyperparameter tuning and naturally accommodating scale variations across multilingual byte sequences. Contribution/Results: On the Ted-59 benchmark, MoCE significantly outperforms existing byte-level models, uses fewer parameters than mainstream subword-based models, and achieves superior translation performance on low-resource languages. The approach enhances semantic representation capability and cross-lingual generalization in byte-level NMT.

Technology Category

Application Category

๐Ÿ“ Abstract
Byte-based machine translation systems have shown significant potential in massively multilingual settings. Unicode encoding, which maps each character to specific byte(s), eliminates the emergence of unknown words, even in new languages. This avoids out-of-vocabulary risk in multilingual translation and enables broad language scalability. However, byte-level tokenization results in sequences that are hard to interpret due to limited semantic information per byte. Local contextualization has proven effective in assigning initial semantics to tokens, improving sentence comprehension. Nevertheless, variations in encoding rules across languages necessitate an adaptive approach for effective contextualization. To this end, we propose Mixture of Contextualization Experts (MoCE), adaptively selecting and mixing attention heads, which are treated as contextualization experts. This enhances the flexibility of contextualization scales and allows models to search for better contextualization combinations. Experiment results show that our method outperforms existing methods without extensive manual adjustment of hyper-parameters and surpasses subword-based models with fewer parameters in Ted-59 dataset. Our code is available at https://github.com/ictnlp/MoCE.
Problem

Research questions and friction points this paper is trying to address.

Adaptive contextualization for byte-based translation
Enhancing semantic interpretation in multilingual settings
Optimizing contextualization scales without manual adjustments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Byte-based tokenization
Adaptive attention heads
Mixture of Contextualization Experts
๐Ÿ”Ž Similar Papers
No similar papers found.
Langlin Huang
Langlin Huang
Washington University in St. Louis
NLP
Mengyu Bu
Mengyu Bu
Institute of Computing Technology, Chinese Academy of Sciences
Large Language ModelMultilingualityMachine Translation
Y
Yang Feng
Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences; Key Laboratory of AI Safety, Chinese Academy of Sciences; University of Chinese Academy of Sciences