Pretraining Language Models with Subword Regularization: An Empirical Study of BPE Dropout in Low-Resource NLP

📅 2026-05-13

📈 Citations: 0

✨ Influential: 0

career value

140K/year

🤖 AI Summary

This study addresses the mismatch in subword segmentation between pretraining and fine-tuning stages under low-resource conditions by systematically investigating the use of BPE dropout—a stochastic subword segmentation strategy—during pretraining. The authors train monolingual and bilingual BERT models on multilingual low-resource corpora and evaluate them on benchmarks including XNLI, PAWS-X, PAN-X, and MasakhaNER 2.0. Their empirical results demonstrate, for the first time, that applying BPE dropout during pretraining yields significantly better performance than applying it only during fine-tuning, with pronounced gains in data-scarce settings; notably, fine-tuning with BPE dropout alone can even underperform deterministic segmentation. The work further reveals that stochastic segmentation positively enhances morphological boundary alignment, underscoring its critical role in low-resource natural language processing.

📝 Abstract

Subword regularization methods such as BPE dropout are typically applied only during fine-tuning, while pretraining is usually done with deterministic tokenization. This creates a potential segmentation mismatch between pretraining and fine-tuning. We investigate whether applying BPE dropout during pretraining improves downstream performance in low-resource NLP. We train monolingual and bilingual BERT models on downsampled subsets of English, German, French, Spanish, Kiswahili, and isiXhosa, and evaluate them on XNLI, PAWS-X, PAN-X, and MasakhaNER 2.0. Across tasks, the best results are typically obtained when stochastic tokenization is applied during both pretraining and fine-tuning, whereas applying BPE dropout only during fine-tuning can underperform deterministic tokenization in smaller-data settings. This disadvantage diminishes as fine-tuning data increases, while the benefits of pretraining-time BPE dropout are largest when either pretraining or fine-tuning data is scarce. The benefits of BPE dropout are often attributed to better compositional representations, especially for rare words. To examine this, we measure morphological boundary alignment under BPE dropout and find only modest improvements in expected alignment, while better-aligned segmentations remain rare. This suggests that fine-tuning alone may provide limited exposure to such segmentations, whereas stochastic tokenization during pretraining exposes the model to them more consistently. We further show that selectively introducing morphologically aligned segmentations during fine-tuning improves performance mainly for models pretrained without BPE dropout. Overall, these findings suggest that exposure to better-aligned segmentations may contribute to the downstream benefits of applying BPE dropout during pretraining.

Problem

Research questions and friction points this paper is trying to address.

subword regularization

BPE dropout

low-resource NLP

pretraining

tokenization mismatch

Innovation

Methods, ideas, or system contributions that make the work stand out.

BPE dropout

subword regularization

low-resource NLP