🤖 AI Summary
Low-resource languages (LRLs) face dual challenges in spelling correction: severe data scarcity and inadequate model adaptation. This work presents the first systematic evaluation of diverse pretrained language models—including large language models (LLMs), encoder-only, and encoder-decoder architectures—on LRL spelling correction. We propose a hallucination-resistant, fine-grained evaluation framework and open-source LMSpell, a multilingual, model-agnostic spelling checker toolkit. Methodologically, we combine few-shot fine-tuning with linguistically informed prompt engineering, validating effectiveness even on zero-resource languages (e.g., Sinhala) without prior monolingual pretraining. Experiments demonstrate that LLMs substantially outperform conventional pretrained language models (PLMs), exhibiting strong cross-lingual generalization—even in truly zero-resource settings. This study bridges critical gaps in LRL spelling correction by enabling rigorous, model-agnostic comparative analysis and robust evaluation, thereby providing a reusable methodological framework and foundational infrastructure for low-resource NLP.
📝 Abstract
Spell correction is still a challenging problem for low-resource languages (LRLs). While pretrained language models (PLMs) have been employed for spell correction, their use is still limited to a handful of languages, and there has been no proper comparison across PLMs. We present the first empirical study on the effectiveness of PLMs for spell correction, which includes LRLs. We find that Large Language Models (LLMs) outperform their counterparts (encoder-based and encoder-decoder) when the fine-tuning dataset is large. This observation holds even in languages for which the LLM is not pre-trained. We release LMSpell, an easy- to use spell correction toolkit across PLMs. It includes an evaluation function that compensates for the hallucination of LLMs. Further, we present a case study with Sinhala to shed light on the plight of spell correction for LRLs.