Data Augmentation for Spoken Grammatical Error Correction

📅 2025-07-25

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

To address the scarcity of high-quality annotated data for Spoken Grammar Error Correction (SGEC), this paper proposes a fully automated audio-text pair augmentation method. It integrates text-to-speech synthesis, controllable grammatical error injection, and modeling of disfluency patterns to generate utterances exhibiting both grammatical errors and authentic spoken-language disfluencies. A multi-dimensional objective evaluation framework—assessing language proficiency consistency, error type coverage, and audio-text alignment—is introduced to filter high-fidelity augmented samples. Experiments on the S&I Corpus demonstrate substantial improvements in both written and spoken GEC model performance (F₀.₅ gains of 3.2–5.8 points), while rigorously preserving second-language learners’ proficiency assessment scores. This work establishes the first reproducible, objectively evaluable, and speech-context-aware data augmentation paradigm for SGEC.

Technology Category

Application Category

📝 Abstract

While there exist strong benchmark datasets for grammatical error correction (GEC), high-quality annotated spoken datasets for Spoken GEC (SGEC) are still under-resourced. In this paper, we propose a fully automated method to generate audio-text pairs with grammatical errors and disfluencies. Moreover, we propose a series of objective metrics that can be used to evaluate the generated data and choose the more suitable dataset for SGEC. The goal is to generate an augmented dataset that maintains the textual and acoustic characteristics of the original data while providing new types of errors. This augmented dataset should augment and enrich the original corpus without altering the language assessment scores of the second language (L2) learners. We evaluate the use of the augmented corpus both for written GEC (the text part) and for SGEC (the audio-text pairs). Our experiments are conducted on the S&I Corpus, the first publicly available speech dataset with grammar error annotations.

Problem

Research questions and friction points this paper is trying to address.

Lack of high-quality annotated spoken datasets for Spoken GEC

Need automated method to generate audio-text pairs with errors

Require objective metrics to evaluate generated SGEC datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated generation of audio-text error pairs

Objective metrics for evaluating generated data

Augmented dataset preserving original characteristics

🔎 Similar Papers

No similar papers found.