Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization

πŸ“… 2025-08-06
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Conventional Byte-Pair Encoding (BPE) tokenization, relying solely on subword frequency statistics, exhibits a strong bias toward high-resource languages, leading to overlong and morphologically ill-formed tokens for low-resource languages, excessive <UNK> usage, and exacerbated cross-lingual computational and cost inequities. Method: We propose Parity-aware BPEβ€”a fairness-conscious tokenization algorithm that, at each merge step, dynamically prioritizes the language with the worst current compression ratio, jointly optimizing frequency-based likelihood and cross-lingual compression parity. Contribution/Results: Our method achieves a near-optimal trade-off: negligible global compression loss (<0.3%) while substantially improving cross-lingual fairness. Experiments on multilingual corpora show a 42% reduction in the standard deviation of per-language token counts; downstream language modeling performance remains unchanged. To our knowledge, this is the first work to systematically reconcile tokenization fairness and efficiency within the BPE framework.

Technology Category

Application Category

πŸ“ Abstract
Tokenization is the first -- and often least scrutinized -- step of most NLP pipelines. Standard algorithms for learning tokenizers rely on frequency-based objectives, which favor languages dominant in the training data and consequently leave lower-resource languages with tokenizations that are disproportionately longer, morphologically implausible, or even riddled with <UNK> placeholders. This phenomenon ultimately amplifies computational and financial inequalities between users from different language backgrounds. To remedy this, we introduce Parity-aware Byte Pair Encoding (BPE), a variant of the widely-used BPE algorithm. At every merge step, Parity-aware BPE maximizes the compression gain of the currently worst-compressed language, trading a small amount of global compression for cross-lingual parity. We find empirically that Parity-aware BPE leads to more equitable token counts across languages, with negligible impact on global compression rate and no substantial effect on language-model performance in downstream tasks.
Problem

Research questions and friction points this paper is trying to address.

Improving cross-lingual fairness in tokenization algorithms
Addressing disproportionate token lengths in low-resource languages
Reducing computational inequalities across different language backgrounds
Innovation

Methods, ideas, or system contributions that make the work stand out.

Parity-aware BPE for cross-lingual fairness
Maximizes compression gain for worst-compressed language
Balances token counts without harming performance
πŸ”Ž Similar Papers
No similar papers found.