🤖 AI Summary
Standard Byte-Pair Encoding (BPE) yields suboptimal subword segmentation for whitespace-free languages like Chinese, as it ignores linguistic boundaries. To address this, we propose an unsupervised pre-tokenization method grounded in information entropy. Our approach innovatively integrates pointwise mutual information (PMI), character-level left and right adjacency entropies, and prediction entropy derived from a pretrained GPT-2 model to construct a language-aware boundary detection mechanism—guiding BPE merges at semantically and statistically salient positions. The method requires no manual annotation or linguistic rules, making it particularly suitable for low-resource and multilingual settings. Evaluated on the PKU word segmentation benchmark, it achieves significant improvements in precision, recall, and F1-score, while also enhancing alignment with gold-standard word segmentation. These results empirically validate its effectiveness in producing linguistically plausible subword units.
📝 Abstract
Byte-Pair Encoding (BPE) has become a widely adopted subword tokenization method in modern language models due to its simplicity and strong empirical performance across downstream tasks. However, applying BPE to unsegmented languages such as Chinese presents significant challenges, as its frequency-driven merge operation is agnostic to linguistic boundaries. To address this, we propose two entropy-informed pre-tokenization strategies that guide BPE segmentation using unsupervised information-theoretic cues. The first approach uses pointwise mutual information and left/right entropy to identify coherent character spans, while the second leverages predictive entropy derived from a pretrained GPT-2 model to detect boundary uncertainty. We evaluate both methods on a subset of the PKU dataset and demonstrate substantial improvements in segmentation precision, recall, and F1 score compared to standard BPE. Our results suggest that entropy-guided pre-tokenization not only enhances alignment with gold-standard linguistic units but also offers a promising direction for improving tokenization quality in low-resource and multilingual settings.