KoGEC : Korean Grammatical Error Correction with Pre-trained Translation Models

πŸ“… 2025-06-13
πŸ›οΈ Pacific Asia Conference on Language, Information and Computation
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the lack of lightweight, domain-specialized models for Korean Grammatical Error Correction (KoGEC), this paper proposes a fine-tuning framework based on the NLLB multilingual translation model, incorporating language-specific token markers and a domain adaptation strategy tailored to social media text. We introduce an innovative β€œLLM-as-judge” evaluation paradigm to enable fine-grained error type classification and performance attribution analysis. Experiments on two Korean social media benchmarks demonstrate that our KoGEC system significantly outperforms GPT-4o and HCX-3, particularly in correcting punctuation, postpositional particles, and word order errors, while achieving superior correction balance across error types. The system is open-sourced and deployed as a Chrome extension, thereby filling a critical gap in lightweight, production-ready KoGEC solutions.

Technology Category

Application Category

πŸ“ Abstract
This research introduces KoGEC, a Korean Grammatical Error Correction system using pre--trained translation models. We fine-tuned NLLB (No Language Left Behind) models for Korean GEC, comparing their performance against large language models like GPT-4 and HCX-3. The study used two social media conversation datasets for training and testing. The NLLB models were fine-tuned using special language tokens to distinguish between original and corrected Korean sentences. Evaluation was done using BLEU scores and an"LLM as judge"method to classify error types. Results showed that the fine-tuned NLLB (KoGEC) models outperformed GPT-4o and HCX-3 in Korean GEC tasks. KoGEC demonstrated a more balanced error correction profile across various error types, whereas the larger LLMs tended to focus less on punctuation errors. We also developed a Chrome extension to make the KoGEC system accessible to users. Finally, we explored token vocabulary expansion to further improve the model but found it to decrease model performance. This research contributes to the field of NLP by providing an efficient, specialized Korean GEC system and a new evaluation method. It also highlights the potential of compact, task-specific models to compete with larger, general-purpose language models in specialized NLP tasks.
Problem

Research questions and friction points this paper is trying to address.

Develop Korean Grammatical Error Correction (GEC) system using pre-trained models
Compare performance of fine-tuned NLLB models against large language models
Evaluate effectiveness across diverse error types including punctuation errors
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuned NLLB models for Korean GEC
Used special tokens for sentence distinction
Developed Chrome extension for user accessibility
πŸ”Ž Similar Papers
No similar papers found.
T
Taeeun Kim
Sionic AI Inc., Seoul, Korea; Emory University, Atlanta, GA, USA
Youngsook Song
Youngsook Song
Lablup ML Researcher
ꡭ어학인곡지λŠ₯
S
Semin Jeong
Sionic AI Inc., Seoul, Korea