🤖 AI Summary
This study addresses the challenge of first-language (L1) interference in English writing by native Russian speakers, particularly issues such as lexical transliteration and tense misuse. To this end, the authors construct RILEC, a large-scale dataset, and propose the first hybrid framework that integrates rule-based and neural approaches for generating and detecting L1-induced errors. The framework leverages expert annotations and synthetic data, employing generative language models, rule-based patterns, and prompt engineering for data augmentation, with Proximal Policy Optimization (PPO)–based reinforcement learning used to enhance the diversity and realism of generated errors. Experimental results demonstrate that the proposed method significantly improves error detection performance at the word level across various interference types, thereby validating the effectiveness of the data augmentation strategy.
📝 Abstract
Many errors in student essays can be explained by influence from the native language (L1). L1 interference refers to errors influenced by a speaker's first language, such as using stadion instead of stadium, reflecting lexical transliteration from Russian. In this work, we address the task of detecting such errors in English essays written by Russian-speaking learners. We introduce RILEC, a large-scale dataset of over 18,000 sentences, combining expert-annotated data from REALEC with synthetic examples generated through rule-based and neural augmentation. We propose a framework for generating L1-motivated errors using generative language models optimized with PPO, prompt-based control, and rule-based patterns. Models fine-tuned on RILEC achieve strong performance, particularly on word-level interference types such as transliteration and tense semantics. We find that the proposed augmentation pipeline leads to a significant performance improvement, making it a potentially valuable tool for learners and teachers to more effectively identify and address such errors.