🤖 AI Summary
This study addresses the scarcity of high-quality grammatical error correction (GEC) resources for Romanian by constructing the first Romanian GEC corpus comprising 10,000 sentence pairs and adapting the ERRANT toolkit for evaluation and error-type analysis. The authors propose a novel method for generating synthetic training data that relies solely on part-of-speech taggers, enabling efficient extension to any low-resource language. Building upon a Transformer architecture, their approach integrates noise injection, synthetic data pretraining, and fine-tuning strategies. Experimental results demonstrate a substantial improvement over baseline systems, with the F0.5 score increasing from 44.38 to 53.76, significantly outperforming traditional rule-based approaches. These findings validate the effectiveness and generalizability of the proposed method in low-resource GEC scenarios.
📝 Abstract
Resources for Grammatical Error Correction (GEC) in non-English languages are scarce, while available spellcheckers in these languages are mostly limited to simple corrections and rules. In this paper we introduce a first GEC corpus for Romanian consisting of 10k pairs of sentences. In addition, the German version of ERRANT (ERRor ANnotation Toolkit) scorer was adapted for Romanian to analyze this corpus and extract edits needed for evaluation. Multiple neural models were experimented, together with pretraining strategies, which proved effective for GEC in low-resource settings. Our baseline consists of a small Transformer model trained only on the GEC dataset (F0.5 of 44.38), whereas the best performing model is produced by pretraining a larger Transformer model on artificially generated data, followed by finetuning on the actual corpus (F0.5 of 53.76). The proposed method for generating additional training examples is easily extensible and can be applied to any language, as it requires only a POS tagger