VAIYAKARANA : A Benchmark for Automatic Grammar Correction in Bangla

📅 2024-06-20
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Grammatical error correction (GEC) for Bengali—the world’s fifth most spoken language—suffers from a lack of high-quality authentic error corpora and standardized evaluation benchmarks. Method: We propose a linguistics-driven, controllable error generation framework, establishing the first systematic taxonomy comprising five coarse-grained and twelve fine-grained error types. Contribution/Results: Based on this taxonomy, we release the first large-scale Bengali GEC benchmark—BengaliGEC—with 92,830 erroneous sentences, 18,426 corrected sentences, and 619 native-speaker–authentic errors, all manually verified and rigorously evaluated across multiple models. Experiments demonstrate substantial performance gains for Seq2Seq, BERT-based, and LLM-based GEC systems trained on our data; however, even state-of-the-art models remain markedly inferior to native speakers in grammatical acceptability judgment. Our linguistically grounded methodology exhibits strong cross-lingual transfer potential to other Indo-Aryan languages.

Technology Category

Application Category

📝 Abstract
Bangla (Bengali) is the fifth most spoken language globally and, yet, the problem of automatic grammar correction in Bangla is still in its nascent stage. This is mostly due to the need for a large corpus of grammatically incorrect sentences, with their corresponding correct counterparts. The present state-of-the-art techniques to curate a corpus for grammatically wrong sentences involve random swapping, insertion and deletion of words. However,these steps may not always generate grammatically wrong sentences in Bangla. In this work, we propose a pragmatic approach to generate grammatically wrong sentences in Bangla. We first categorize the different kinds of errors in Bangla into 5 broad classes and 12 finer classes. We then use these to generate grammatically wrong sentences systematically from a correct sentence. This approach can generate a large number of wrong sentences and can, thus, mitigate the challenge of lacking a large corpus for neural networks. We provide a dataset, Vaiyakarana, consisting of 92,830 grammatically incorrect sentences as well as 18,426 correct sentences. We also collected 619 human-generated sentences from essays written by Bangla native speakers. This helped us to understand errors that are more frequent. We evaluated our corpus against neural models and LLMs and also benchmark it against human evaluators who are native speakers of Bangla. Our analysis shows that native speakers are far more accurate than state-of-the-art models to detect whether the sentence is grammatically correct. Our methodology of generating erroneous sentences can be applied for most other Indian languages as well.
Problem

Research questions and friction points this paper is trying to address.

Improving Bangla grammatical error correction using LLMs
Creating synthetic dataset for Bangla error correction
Evaluating LLMs performance in Bangla grammar tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Categorize 12 Bangla error classes
Create synthetic data via noise injection
Instruction-tune LLMs for Bangla GEC
🔎 Similar Papers
No similar papers found.
Pramit Bhattacharyya
Pramit Bhattacharyya
Ph.D. Scholar,IIT Kanpur
Knowledge GraphNLPData MiningSemantic Web
A
Arnab Bhattacharya
Dept. of Computer Science and Engineering, Indian Institute of Technology Kanpur, India