Segmentation Beyond Defaults: Asymmetrical Byte Pair Encoding for Optimal Machine Translation Performance

📅 2025-11-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Conventional machine translation research predominantly employs symmetric Byte-Pair Encoding (BPE), where the source and target sides share identical numbers of merge operations (NMOs); however, this paradigm suffers from performance degradation in low-resource settings due to subword granularity mismatch. Method: We propose an asymmetric BPE tokenization strategy that systematically investigates the impact of decoupling NMOs between source and target sides. Contribution/Results: Extensive experiments across seven language pairs and multiple data scales—particularly in low-resource settings such as English–Hindi—demonstrate that high-source/low-target NMO configurations yield substantial improvements: average CHRF++ gains of 5.32, with statistically significant improvements in 10 out of 12 systems. Our findings indicate that breaking BPE symmetry effectively mitigates subword granularity misalignment in low-resource translation, establishing a novel paradigm for tokenizer design.

Technology Category

Application Category

📝 Abstract
Existing Machine Translation (MT) research often suggests a single, fixed set of hyperparameters for word segmentation models, symmetric Byte Pair Encoding (BPE), which applies the same number of merge operations (NMO) to train tokenizers for both source and target languages. However, we demonstrate that this uniform approach doesn't guarantee optimal MT performance across different language pairs and data sizes. This work investigates BPE segmentation recipes across various data volumes and language pairs to evaluate MT system performance. We find that utilizing asymmetric BPE, where the source and target languages have different NMOs, significantly improves results over the symmetric approach, especially in low-resource settings (50K, 100K, and 500K sentence pairs). Specifically, asymmetric BPE yield statistically significant ($p<0.05$) average gains of 5.32, 4.46, and 0.7 CHRF++ on English-Hindi in low-resource setups. We validated this trend across six additional language pairs (English and Telugu, Shona, Norwegian, Kyrgyz, Hausa, and Inuktitut), observing statistically significant improvement in 10 out of 12 systems compared to symmetric BPE. Our findings indicate a high NMO for the source (4K to 32K) and a low NMO for the target (0.5K to 2K) provides optimal results, particularly benefiting low-resource MT.
Problem

Research questions and friction points this paper is trying to address.

Challenges fixed symmetric BPE segmentation in machine translation systems
Investigates optimal BPE configurations across language pairs and data sizes
Proposes asymmetric BPE to improve low-resource translation performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Asymmetric BPE with different merge operations for languages
Higher source language merges and lower target merges
Significant improvements in low-resource machine translation performance
🔎 Similar Papers
No similar papers found.
S
Saumitra Yadav
Language Technologies Research Center, KCIS, International Institute of Information Technology Hyderabad, India
Manish Shrivastava
Manish Shrivastava
International Institute of Information Technology Hyderabad
Natural Language ProcessingMachine LearningMachine TranslationCross Lingual IRMultilingual Question Answering