🤖 AI Summary
This work addresses the automatic evaluation of AI tutoring agents’ capability to identify students’ mathematical reasoning errors. Existing methods suffer from poor interpretability and limited generalization across problem types. To tackle these issues, we propose a retrieval-augmented few-shot prompting framework that integrates semantic retrieval with structured prompt engineering and introduces a novel schema-guided output parsing mechanism—enhancing both interpretability and cross-problem-type generalization. Furthermore, we design a history-aware attention module that fuses Sentence-BERT and multi-model embeddings for robust representation learning. Evaluated on the BEA 2025 Track 1 benchmark using GPT-4o as the base model, our approach achieves state-of-the-art accuracy and F1 scores, outperforming all baselines. The implementation is publicly available.
📝 Abstract
This paper presents our system for Track 1: Mistake Identification in the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors. The task involves evaluating whether a tutor's response correctly identifies a mistake in a student's mathematical reasoning. We explore four approaches: (1) an ensemble of machine learning models over pooled token embeddings from multiple pretrained language models (LMs); (2) a frozen sentence-transformer using [CLS] embeddings with an MLP classifier; (3) a history-aware model with multi-head attention between token-level history and response embeddings; and (4) a retrieval-augmented few-shot prompting system with a large language model (LLM) i.e. GPT 4o. Our final system retrieves semantically similar examples, constructs structured prompts, and uses schema-guided output parsing to produce interpretable predictions. It outperforms all baselines, demonstrating the effectiveness of combining example-driven prompting with LLM reasoning for pedagogical feedback assessment. Our code is available at https://github.com/NaumanNaeem/BEA_2025.