LMSpell: Neural Spell Checking for Low-Resource Languages

📅 2025-12-04

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

Low-resource languages (LRLs) face dual challenges in spelling correction: severe data scarcity and inadequate model adaptation. This work presents the first systematic evaluation of diverse pretrained language models—including large language models (LLMs), encoder-only, and encoder-decoder architectures—on LRL spelling correction. We propose a hallucination-resistant, fine-grained evaluation framework and open-source LMSpell, a multilingual, model-agnostic spelling checker toolkit. Methodologically, we combine few-shot fine-tuning with linguistically informed prompt engineering, validating effectiveness even on zero-resource languages (e.g., Sinhala) without prior monolingual pretraining. Experiments demonstrate that LLMs substantially outperform conventional pretrained language models (PLMs), exhibiting strong cross-lingual generalization—even in truly zero-resource settings. This study bridges critical gaps in LRL spelling correction by enabling rigorous, model-agnostic comparative analysis and robust evaluation, thereby providing a reusable methodological framework and foundational infrastructure for low-resource NLP.

Technology Category

Application Category

📝 Abstract

Spell correction is still a challenging problem for low-resource languages (LRLs). While pretrained language models (PLMs) have been employed for spell correction, their use is still limited to a handful of languages, and there has been no proper comparison across PLMs. We present the first empirical study on the effectiveness of PLMs for spell correction, which includes LRLs. We find that Large Language Models (LLMs) outperform their counterparts (encoder-based and encoder-decoder) when the fine-tuning dataset is large. This observation holds even in languages for which the LLM is not pre-trained. We release LMSpell, an easy- to use spell correction toolkit across PLMs. It includes an evaluation function that compensates for the hallucination of LLMs. Further, we present a case study with Sinhala to shed light on the plight of spell correction for LRLs.

Problem

Research questions and friction points this paper is trying to address.

Spell correction for low-resource languages is challenging

Pretrained language models lack proper comparison across languages

Large language models outperform others with sufficient fine-tuning data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models outperform other pretrained models

Toolkit includes evaluation function compensating for LLM hallucinations

Empirical study covers low-resource languages without pretraining

🔎 Similar Papers

No similar papers found.