🤖 AI Summary
This study investigates the impact of editorial interventions on native language identification in human–AI collaborative writing. Drawing on 450 essays from the Write & Improve 2024 corpus, the authors simulate varying degrees of editing intensity through multi-stage grammatical error correction and rewriting, systematically evaluating the robustness of native language identification models on post-edited texts. The work provides the first empirical evidence that native-language traces stem not only from surface-level linguistic errors but also from deeper features such as non-idiomatic lexical choices, pragmatic transfer, and culturally grounded perspectives. Results show that light editing preserves structural L1 characteristics and sustains high identification accuracy, whereas highly fluent rewrites substantially attenuate these features, leading to a marked decline in model performance.
📝 Abstract
Native Language Identification (NLI) is the task of determining an author's native language (L1) from their non-native writings. With the advent of human-AI co-authorship, non-native texts are routinely corrected and rewritten by large language models, fundamentally altering the linguistic features NLI models depend on. In this paper, we investigate the robustness of L1 traces across increasing degrees of editorial intervention. By processing 450 essays from the Write & Improve 2024 corpus through varying levels of grammatical error correction (GEC) and paraphrasing, we demonstrate that L1 attribution does not entirely depend on surface-level errors. Instead, the detection models leverage deeper L1 features: unidiomatic lexico-semantic choices, pragmatic transfer, and the author's underlying cultural perspective. We find that minimal edits preserve these structural traces and maintain high profiling accuracy. In contrast, fluency edits and paraphrasing normalize these L1 features, leading to a severe degradation in performance.