🤖 AI Summary
This study addresses the challenge of automatically selecting high-informativeness contextual examples for native-language vocabulary instruction among high school students. The authors propose a hybrid approach that integrates deep learning with handcrafted features to predict contextual informativeness. Central to their method is a novel metric, the Retention Competency Curve, which is combined with instruction-aware embeddings—derived from MPNet and Qwen3—and manually engineered contextual features, all modeled through a nonlinear regression head. Experimental results demonstrate that the optimal model achieves a dramatic improvement in corpus quality: while discarding only 30% of high-quality contexts, it elevates the ratio of informative to non-informative contexts to 440:1, substantially enhancing the efficacy of instructional materials.
📝 Abstract
We describe a modern deep learning system that automatically identifies informative contextual examples (\qu{contexts}) for first language vocabulary instruction for high school student. Our paper compares three modeling approaches: (i) an unsupervised similarity-based strategy using MPNet's uniformly contextualized embeddings, (ii) a supervised framework built on instruction-aware, fine-tuned Qwen3 embeddings with a nonlinear regression head and (iii) model (ii) plus handcrafted context features. We introduce a novel metric called the Retention Competency Curve to visualize trade-offs between the discarded proportion of good contexts and the \qu{good-to-bad} contexts ratio providing a compact, unified lens on model performance. Model (iii) delivers the most dramatic gains with performance of a good-to-bad ratio of 440 all while only throwing out 70\% of the good contexts. In summary, we demonstrate that a modern embedding model on neural network architecture, when guided by human supervision, results in a low-cost large supply of near-perfect contexts for teaching vocabulary for a variety of target words.