🤖 AI Summary
This study investigates how native language background influences the difficulty of learning English vocabulary, focusing on speakers of Spanish, German, and Chinese. By constructing a gradient boosting model that integrates features such as word frequency, semantic properties, surface form characteristics, and cross-linguistic transfer effects—and employing Shapley values for interpretability—the research provides the first systematic quantification of native-language-specific transfer effects on English lexical difficulty. The findings reveal that native Chinese speakers exhibit a distinct transfer pattern due to the absence of orthographic similarity with English. The proposed approach enables the generation of interpretable, native-language-tailored vocabulary difficulty estimates, offering robust support for personalized vocabulary instruction.
📝 Abstract
What makes a word difficult to learn, and how does the difficulty depend on the learner's native language? We computationally model vocabulary difficulty for English learners whose first language is Spanish, German, or Chinese with gradient-boosted models trained on features related to a word's familiarity (e.g., frequency), meaning, surface form, and cross-linguistic transfer. Using Shapley values, we determine the importance of each feature group. Word familiarity is the dominant feature group shared by all three languages. However, predictions for Spanish- and German-speaking learners rely additionally on orthographic transfer. This transfer mechanism is unavailable to Chinese learners, whose difficulty is shaped by a combination of familiarity and surface features alone. Our models provide interpretable, L1-tailored difficulty estimates that can be used to design vocabulary curricula.