From Characters to Tokens: Dynamic Grouping with Hierarchical BPE

📅 2025-10-17

📈 Citations: 0

✨ Influential: 0

career value

144K/year

🤖 AI Summary

Existing subword tokenization methods (e.g., BPE) suffer from inefficient rare-word modeling and excessively large vocabularies, while character-level models—though robust—introduce computational bottlenecks in Transformer architectures. To address this, we propose a **language-agnostic dynamic character grouping method**, the first to directly leverage BPE segmentation structure for two-stage hierarchical compression: (1) dynamically aggregating isomorphic character blocks based on BPE boundaries, with explicit block-end tokens inserted; and (2) applying lightweight secondary BPE compression over the resulting block sequence. Our approach requires no whitespace assumptions or auxiliary models, preserving character-level generalization while achieving subword-level efficiency. Experiments across multilingual benchmarks show performance competitive with or superior to entropy- and whitespace-driven baselines, alongside substantial reductions in vocabulary size (−38% on average) and sequence length (−29% on average), thereby lowering computational overhead.

Technology Category

Application Category

📝 Abstract

Subword tokenization methods like Byte Pair Encoding (BPE) are widely used in large language models due to their balance of vocabulary compactness and representational power. However, they suffer from inefficiencies in representing rare words and require large embedding matrices. Character-level models address these issues but introduce performance bottlenecks, particularly in Transformer-based architectures. Recent hierarchical models attempt to merge the benefits of both paradigms by grouping characters into patches, but existing patching strategies either rely on whitespace-limiting applicability to certain languages, or require auxiliary models that introduce new dependencies. In this paper, we propose a dynamic character grouping method that leverages the structure of existing BPE tokenization without requiring additional models. By appending explicit end-of-patch markers to BPE tokens and introducing a second-level BPE compression stage to control patch granularity, our method offers efficient, flexible, and language-agnostic representations. Empirical results demonstrate that our approach matches or exceeds the performance of dynamic entropy- and whitespace-based patching strategies, while maintaining a compact vocabulary.

Problem

Research questions and friction points this paper is trying to address.

BPE tokenization inefficiently represents rare words

Character models cause performance bottlenecks in Transformers

Existing patching strategies have language or dependency limitations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic character grouping using BPE structure

Appending end-of-patch markers to tokens

Second-level BPE compression controls granularity

🔎 Similar Papers

Unsupervised Morphological Tree Tokenizer