🤖 AI Summary
Conventional machine translation research predominantly employs symmetric Byte-Pair Encoding (BPE), where the source and target sides share identical numbers of merge operations (NMOs); however, this paradigm suffers from performance degradation in low-resource settings due to subword granularity mismatch. Method: We propose an asymmetric BPE tokenization strategy that systematically investigates the impact of decoupling NMOs between source and target sides. Contribution/Results: Extensive experiments across seven language pairs and multiple data scales—particularly in low-resource settings such as English–Hindi—demonstrate that high-source/low-target NMO configurations yield substantial improvements: average CHRF++ gains of 5.32, with statistically significant improvements in 10 out of 12 systems. Our findings indicate that breaking BPE symmetry effectively mitigates subword granularity misalignment in low-resource translation, establishing a novel paradigm for tokenizer design.
📝 Abstract
Existing Machine Translation (MT) research often suggests a single, fixed set of hyperparameters for word segmentation models, symmetric Byte Pair Encoding (BPE), which applies the same number of merge operations (NMO) to train tokenizers for both source and target languages. However, we demonstrate that this uniform approach doesn't guarantee optimal MT performance across different language pairs and data sizes. This work investigates BPE segmentation recipes across various data volumes and language pairs to evaluate MT system performance. We find that utilizing asymmetric BPE, where the source and target languages have different NMOs, significantly improves results over the symmetric approach, especially in low-resource settings (50K, 100K, and 500K sentence pairs). Specifically, asymmetric BPE yield statistically significant ($p<0.05$) average gains of 5.32, 4.46, and 0.7 CHRF++ on English-Hindi in low-resource setups. We validated this trend across six additional language pairs (English and Telugu, Shona, Norwegian, Kyrgyz, Hausa, and Inuktitut), observing statistically significant improvement in 10 out of 12 systems compared to symmetric BPE. Our findings indicate a high NMO for the source (4K to 32K) and a low NMO for the target (0.5K to 2K) provides optimal results, particularly benefiting low-resource MT.