Segmentation Beyond Defaults: Asymmetrical Byte Pair Encoding for Optimal Machine Translation Performance

📅 2025-11-05

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Conventional machine translation research predominantly employs symmetric Byte-Pair Encoding (BPE), where the source and target sides share identical numbers of merge operations (NMOs); however, this paradigm suffers from performance degradation in low-resource settings due to subword granularity mismatch. Method: We propose an asymmetric BPE tokenization strategy that systematically investigates the impact of decoupling NMOs between source and target sides. Contribution/Results: Extensive experiments across seven language pairs and multiple data scales—particularly in low-resource settings such as English–Hindi—demonstrate that high-source/low-target NMO configurations yield substantial improvements: average CHRF++ gains of 5.32, with statistically significant improvements in 10 out of 12 systems. Our findings indicate that breaking BPE symmetry effectively mitigates subword granularity misalignment in low-resource translation, establishing a novel paradigm for tokenizer design.

Technology Category

Application Category

📝 Abstract

Existing Machine Translation (MT) research often suggests a single, fixed set of hyperparameters for word segmentation models, symmetric Byte Pair Encoding (BPE), which applies the same number of merge operations (NMO) to train tokenizers for both source and target languages. However, we demonstrate that this uniform approach doesn't guarantee optimal MT performance across different language pairs and data sizes. This work investigates BPE segmentation recipes across various data volumes and language pairs to evaluate MT system performance. We find that utilizing asymmetric BPE, where the source and target languages have different NMOs, significantly improves results over the symmetric approach, especially in low-resource settings (50K, 100K, and 500K sentence pairs). Specifically, asymmetric BPE yield statistically significant ($p<0.05$) average gains of 5.32, 4.46, and 0.7 CHRF++ on English-Hindi in low-resource setups. We validated this trend across six additional language pairs (English and Telugu, Shona, Norwegian, Kyrgyz, Hausa, and Inuktitut), observing statistically significant improvement in 10 out of 12 systems compared to symmetric BPE. Our findings indicate a high NMO for the source (4K to 32K) and a low NMO for the target (0.5K to 2K) provides optimal results, particularly benefiting low-resource MT.

Problem

Research questions and friction points this paper is trying to address.

Challenges fixed symmetric BPE segmentation in machine translation systems

Investigates optimal BPE configurations across language pairs and data sizes

Proposes asymmetric BPE to improve low-resource translation performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Asymmetric BPE with different merge operations for languages

Higher source language merges and lower target merges

Significant improvements in low-resource machine translation performance

🔎 Similar Papers

No similar papers found.