Can Perplexity Predict Fine-Tuning Performance? An Investigation of Tokenization Effects on Sequential Language Models for Nepali

📅 2024-04-28

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

🤖 AI Summary

It remains unclear whether subword tokenization strategies for non-Latin scripts—such as Nepali—merely optimize perplexity or genuinely enhance downstream language understanding. Method: We conduct a systematic empirical study using a unified small autoregressive Transformer architecture, evaluating six mainstream tokenization methods—including byte-level BPE and SentencePiece—via fine-tuning on multiple Nepali downstream tasks. Contribution/Results: SentencePiece substantially outperforms byte-level approaches; critically, perplexity reduction does not reliably correlate with downstream performance gains, revealing a non-monotonic relationship between token granularity and linguistic understanding. This work presents the first comprehensive, multi-method tokenization analysis for Nepali, challenging the validity of perplexity as a universal proxy metric and providing empirically grounded guidance for tokenizer selection in low-resource, non-Latin languages.

Technology Category

Application Category

📝 Abstract

Recent language models use subwording mechanisms to handle Out-of-Vocabulary(OOV) words seen during test time and, their generation capacity is generally measured using perplexity, an intrinsic metric. It is known that increasing the subword granularity results in a decrease of perplexity value. However, the study of how subwording affects the understanding capacity of language models has been very few and only limited to a handful of languages. To reduce this gap we used 6 different tokenization schemes to pretrain relatively small language models in Nepali and used the representations learned to finetune on several downstream tasks. Although byte-level BPE algorithm has been used in recent models like GPT, RoBERTa we show that on average they are sub-optimal in comparison to algorithms such as SentencePiece in finetuning performances for Nepali. Additionally, similar recent studies have focused on the Bert-based language model. We, however, pretrain and finetune sequential transformer-based language models.

Problem

Research questions and friction points this paper is trying to address.

Investigates tokenization effects on Nepali language model performance

Compares six tokenization strategies for understanding-based tasks

Highlights SentencePiece superiority over BPE for low-resource languages

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated six tokenization strategies for Nepali

Used SentencePiece tokenization for better performance

Focused on sequential transformer models specifically

🔎 Similar Papers

No similar papers found.

Authors to Follow