LMSpell: Neural Spell Checking for Low-Resource Languages

📅 2025-12-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Low-resource languages (LRLs) face dual challenges in spelling correction: severe data scarcity and inadequate model adaptation. This work presents the first systematic evaluation of diverse pretrained language models—including large language models (LLMs), encoder-only, and encoder-decoder architectures—on LRL spelling correction. We propose a hallucination-resistant, fine-grained evaluation framework and open-source LMSpell, a multilingual, model-agnostic spelling checker toolkit. Methodologically, we combine few-shot fine-tuning with linguistically informed prompt engineering, validating effectiveness even on zero-resource languages (e.g., Sinhala) without prior monolingual pretraining. Experiments demonstrate that LLMs substantially outperform conventional pretrained language models (PLMs), exhibiting strong cross-lingual generalization—even in truly zero-resource settings. This study bridges critical gaps in LRL spelling correction by enabling rigorous, model-agnostic comparative analysis and robust evaluation, thereby providing a reusable methodological framework and foundational infrastructure for low-resource NLP.

Technology Category

Application Category

📝 Abstract
Spell correction is still a challenging problem for low-resource languages (LRLs). While pretrained language models (PLMs) have been employed for spell correction, their use is still limited to a handful of languages, and there has been no proper comparison across PLMs. We present the first empirical study on the effectiveness of PLMs for spell correction, which includes LRLs. We find that Large Language Models (LLMs) outperform their counterparts (encoder-based and encoder-decoder) when the fine-tuning dataset is large. This observation holds even in languages for which the LLM is not pre-trained. We release LMSpell, an easy- to use spell correction toolkit across PLMs. It includes an evaluation function that compensates for the hallucination of LLMs. Further, we present a case study with Sinhala to shed light on the plight of spell correction for LRLs.
Problem

Research questions and friction points this paper is trying to address.

Spell correction for low-resource languages is challenging
Pretrained language models lack proper comparison across languages
Large language models outperform others with sufficient fine-tuning data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models outperform other pretrained models
Toolkit includes evaluation function compensating for LLM hallucinations
Empirical study covers low-resource languages without pretraining
🔎 Similar Papers
No similar papers found.
A
Akesh Gunathilakea
Department of Computer Science and Engineering, University of Moratuwa, Katubedda, 10400, Sri Lanka
N
Nadil Karunarathnea
Department of Computer Science and Engineering, University of Moratuwa, Katubedda, 10400, Sri Lanka
T
Tharusha Bandaranayakea
Department of Computer Science and Engineering, University of Moratuwa, Katubedda, 10400, Sri Lanka
N
Nisansa de Silvaa
Department of Computer Science and Engineering, University of Moratuwa, Katubedda, 10400, Sri Lanka
Surangika Ranathunga
Surangika Ranathunga
Senior Lecturer, School of Mathematical and Computational Sciences, Massey University, New Zealand
Natural Language ProcessingMachine LearningLarge Language Models