🤖 AI Summary
This study addresses the evaluation of AI teaching agents’ pedagogical competence in educational dialogues, focusing on two core tasks: student error detection (Track 1) and error localization (Track 2), both formulated as three-class classification problems. To tackle data scarcity and non-i.i.d. dialogue samples prevalent in educational NLP, we propose an MPNet-based hard-voting ensemble method with grouped cross-validation. We further introduce a novel comprehensive evaluation framework integrating error typology analysis, t-SNE-based interpretability visualization, and fine-grained confusion matrices. Our model employs MPNet pretraining, class-weighted loss, and 10-fold grouped CV. On the official test set, it achieves macro-F1 scores of 0.7110 for error detection and 0.5543 for error localization—substantially outperforming baselines. Results validate the efficacy and interpretability of our ensemble strategy and evaluation paradigm in low-resource educational settings.
📝 Abstract
We present Team BD's submission to the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors, under Track 1 (Mistake Identification) and Track 2 (Mistake Location). Both tracks involve three-class classification of tutor responses in educational dialogues - determining if a tutor correctly recognizes a student's mistake (Track 1) and whether the tutor pinpoints the mistake's location (Track 2). Our system is built on MPNet, a Transformer-based language model that combines BERT and XLNet's pre-training advantages. We fine-tuned MPNet on the task data using a class-weighted cross-entropy loss to handle class imbalance, and leveraged grouped cross-validation (10 folds) to maximize the use of limited data while avoiding dialogue overlap between training and validation. We then performed a hard-voting ensemble of the best models from each fold, which improves robustness and generalization by combining multiple classifiers. Our approach achieved strong results on both tracks, with exact-match macro-F1 scores of approximately 0.7110 for Mistake Identification and 0.5543 for Mistake Location on the official test set. We include comprehensive analysis of our system's performance, including confusion matrices and t-SNE visualizations to interpret classifier behavior, as well as a taxonomy of common errors with examples. We hope our ensemble-based approach and findings provide useful insights for designing reliable tutor response evaluation systems in educational dialogue settings.