🤖 AI Summary
This work addresses key challenges in identifying students’ misconceptions—namely data scarcity, high annotation noise, pretraining biases and deployment difficulties of large models, and overfitting tendencies of small models—by proposing a two-stage knowledge distillation framework. In the first stage, task-specific capabilities are transferred from a teacher model to a compact student model. The second stage introduces a dual-level margin-based sample selection mechanism grounded in cognitive uncertainty to identify four categories of critical samples, coupled with a difficulty-adaptive strategy that dynamically blends hard and soft labels to enhance discrimination of ambiguous error types. Using only 10.30% of high-value samples, the method achieves a MAP@3 of 0.9585 (+17.8%) and 84.38% accuracy on cross-topic algebra misconception classification in middle school, substantially outperforming state-of-the-art large language models (67.73%) and even a fine-tuned 72B-parameter model (81.25%).
📝 Abstract
Accurately identifying student misconceptions is crucial for personalized education but faces three challenges: (1) data scarcity with long-tail distribution, where authentic student reasoning is difficult to synthesize; (2) fuzzy boundaries between error categories with high annotation noise; (3) deployment parado-large models overlook unconventional approaches due to pretraining bias and cannot be deployed on edge, while small models overfit to noise. Unlike traditional methods that increase diversity through large-scale data synthesis, we propose a two-stage knowledge distillation framework that mines high-value samples from existing data. The first stage performs standard distillation to transfer task capabilities. The second stage introduces a dual-layer marginal selection mechanism based on cognitive uncertainty, identifying four types of critical samples based on teacher model uncertainty and confidence differences. For different data subsets, we design difficulty-adaptive mechanism to balance hard/soft label contributions, enabling student models to inherit inter-class relationships from teacher soft labels while distinguishing ambiguous error types. Experiments show that with augmented training on only 10.30% of filtered samples, we achieve MAP@3 of 0.9585 (+17.8%) on the MAP-Charting dataset, and using only a 4B parameter model, we attain 84.38% accuracy on cross-topic tests of middle school algebra misconception benchmarks, significantly outperforming sota LLM (67.73%) and standard fine-tuned 72B models (81.25%). Our code is available at https://github.com/RoschildRui/acl2026_map.