Neural Grammatical Error Correction for Romanian

📅 2026-04-26
📈 Citations: 0
Influential: 0
📄 PDF

career value

174K/year
🤖 AI Summary
This study addresses the scarcity of high-quality grammatical error correction (GEC) resources for Romanian by constructing the first Romanian GEC corpus comprising 10,000 sentence pairs and adapting the ERRANT toolkit for evaluation and error-type analysis. The authors propose a novel method for generating synthetic training data that relies solely on part-of-speech taggers, enabling efficient extension to any low-resource language. Building upon a Transformer architecture, their approach integrates noise injection, synthetic data pretraining, and fine-tuning strategies. Experimental results demonstrate a substantial improvement over baseline systems, with the F0.5 score increasing from 44.38 to 53.76, significantly outperforming traditional rule-based approaches. These findings validate the effectiveness and generalizability of the proposed method in low-resource GEC scenarios.

Technology Category

Application Category

📝 Abstract
Resources for Grammatical Error Correction (GEC) in non-English languages are scarce, while available spellcheckers in these languages are mostly limited to simple corrections and rules. In this paper we introduce a first GEC corpus for Romanian consisting of 10k pairs of sentences. In addition, the German version of ERRANT (ERRor ANnotation Toolkit) scorer was adapted for Romanian to analyze this corpus and extract edits needed for evaluation. Multiple neural models were experimented, together with pretraining strategies, which proved effective for GEC in low-resource settings. Our baseline consists of a small Transformer model trained only on the GEC dataset (F0.5 of 44.38), whereas the best performing model is produced by pretraining a larger Transformer model on artificially generated data, followed by finetuning on the actual corpus (F0.5 of 53.76). The proposed method for generating additional training examples is easily extensible and can be applied to any language, as it requires only a POS tagger
Problem

Research questions and friction points this paper is trying to address.

Grammatical Error Correction
low-resource languages
Romanian
GEC corpus
neural models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Grammatical Error Correction
low-resource NLP
artificial data generation
Transformer pretraining
ERRANT adaptation
🔎 Similar Papers
No similar papers found.