Can Character-based Language Models Improve Downstream Task Performances In Low-Resource And Noisy Language Scenarios?

📅 2021-10-26

🏛️ WNUT

📈 Citations: 15

✨ Influential: 0

career value

141K/year

🤖 AI Summary

Low-resource, highly variable nonstandard languages—such as North African Arabic (NArabizi)—pose significant challenges for part-of-speech tagging and dependency parsing due to orthographic inconsistency, lack of standardized segmentation, and scarcity of annotated data. Method: We propose a lightweight character-level Transformer architecture, fine-tuned on only 99k dialectal sentences without subword tokenization. Contribution/Results: Our model achieves performance competitive with large multilingual BERT and monolingual pretrained baselines, despite minimal supervision. This work presents the first systematic demonstration that character-level models fine-tuned on extremely small dialectal corpora exhibit strong generalization—significantly outperforming conventional subword-based approaches. Further experiments on noisy French data confirm robustness to orthographic variation and domain shift. The proposed paradigm offers an efficient, scalable, and tokenizer-free modeling framework for highly variable, low-resource languages.

📝 Abstract

Recent impressive improvements in NLP, largely based on the success of contextual neural language models, have been mostly demonstrated on at most a couple dozen high- resource languages. Building language mod- els and, more generally, NLP systems for non- standardized and low-resource languages remains a challenging task. In this work, we fo- cus on North-African colloquial dialectal Arabic written using an extension of the Latin script, called NArabizi, found mostly on social media and messaging communication. In this low-resource scenario with data display- ing a high level of variability, we compare the downstream performance of a character-based language model on part-of-speech tagging and dependency parsing to that of monolingual and multilingual models. We show that a character-based model trained on only 99k sentences of NArabizi and fined-tuned on a small treebank of this language leads to performance close to those obtained with the same architecture pre- trained on large multilingual and monolingual models. Confirming these results a on much larger data set of noisy French user-generated content, we argue that such character-based language models can be an asset for NLP in low-resource and high language variability set- tings.

Problem

Research questions and friction points this paper is trying to address.

Improving NLP for low-resource, noisy languages

Evaluating character-based models for dialectal Arabic

Comparing performance on POS tagging and parsing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Character-based model for low-resource languages

Training on 99k NArabizi sentences

Fine-tuning on small treebank data

🔎 Similar Papers

Small Language Models Also Work With Small Vocabularies: Probing the Linguistic Abilities of Grapheme- and Phoneme-Based Baby Llamas