Refining Czech GEC: Insights from a Multi-Experiment Approach

📅 2025-06-27

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

To address the suboptimal performance and efficiency of Czech grammatical error correction (GEC), this paper proposes a Transformer-based neural machine translation framework. Its core innovation is a dynamic hybrid synthetic error generation pipeline that jointly incorporates language-agnostic error patterns and Czech-specific linguistic rules, enhanced by domain-balanced sampling and fine-grained subword tokenization for efficient data augmentation. We systematically investigate the impacts of corpus selection, error injection strategies, and model scale, and evaluate large-model adaptation under both user-prompt fine-tuning and expert-annotated fine-tuning paradigms. Experiments demonstrate state-of-the-art results on the CzechGEC benchmark, with substantial improvements in correction accuracy and faster inference speed compared to existing approaches. The trained models and source code are publicly released.

Technology Category

Application Category

📝 Abstract

We present a grammar error correction (GEC) system that achieves state of the art for the Czech language. Our system is based on a neural network translation approach with the Transformer architecture, and its key feature is its real-time synthetic generation pipeline, which dynamically augments sentences with artificial errors by introducing both language-agnostic and Czech-specific errors. We conduct a comprehensive series of experiments, investigating the Czech GEC corpora as bases for synthetic error introduction, several error generation strategies, domain balancing, tokenization granularity, model size, and data scaling during fine-tuning. Additionally, we evaluate the performance of large language models (LLMs) on Czech GEC in both end-user and expert fine-tuning scenarios. Our best-performing model is superior both in performance and computational efficiency. The source code and the trained model links are available on https://github.com/ufal/tsd2025-gec.

Problem

Research questions and friction points this paper is trying to address.

Develop state-of-the-art Czech grammar error correction system

Explore synthetic error generation for Czech-specific GEC training

Evaluate LLMs and optimize model efficiency for Czech GEC

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based neural network for Czech GEC

Real-time synthetic error generation pipeline

Comprehensive multi-experiment optimization approach

🔎 Similar Papers

No similar papers found.