🤖 AI Summary
It remains unclear whether subword tokenization strategies for non-Latin scripts—such as Nepali—merely optimize perplexity or genuinely enhance downstream language understanding.
Method: We conduct a systematic empirical study using a unified small autoregressive Transformer architecture, evaluating six mainstream tokenization methods—including byte-level BPE and SentencePiece—via fine-tuning on multiple Nepali downstream tasks.
Contribution/Results: SentencePiece substantially outperforms byte-level approaches; critically, perplexity reduction does not reliably correlate with downstream performance gains, revealing a non-monotonic relationship between token granularity and linguistic understanding. This work presents the first comprehensive, multi-method tokenization analysis for Nepali, challenging the validity of perplexity as a universal proxy metric and providing empirically grounded guidance for tokenizer selection in low-resource, non-Latin languages.
📝 Abstract
Recent language models use subwording mechanisms to handle Out-of-Vocabulary(OOV) words seen during test time and, their generation capacity is generally measured using perplexity, an intrinsic metric. It is known that increasing the subword granularity results in a decrease of perplexity value. However, the study of how subwording affects the understanding capacity of language models has been very few and only limited to a handful of languages. To reduce this gap we used 6 different tokenization schemes to pretrain relatively small language models in Nepali and used the representations learned to finetune on several downstream tasks. Although byte-level BPE algorithm has been used in recent models like GPT, RoBERTa we show that on average they are sub-optimal in comparison to algorithms such as SentencePiece in finetuning performances for Nepali. Additionally, similar recent studies have focused on the Bert-based language model. We, however, pretrain and finetune sequential transformer-based language models.