NeuralNexus at BEA 2025 Shared Task: Retrieval-Augmented Prompting for Mistake Identification in AI Tutors

📅 2025-06-12

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

This work addresses the automatic evaluation of AI tutoring agents’ capability to identify students’ mathematical reasoning errors. Existing methods suffer from poor interpretability and limited generalization across problem types. To tackle these issues, we propose a retrieval-augmented few-shot prompting framework that integrates semantic retrieval with structured prompt engineering and introduces a novel schema-guided output parsing mechanism—enhancing both interpretability and cross-problem-type generalization. Furthermore, we design a history-aware attention module that fuses Sentence-BERT and multi-model embeddings for robust representation learning. Evaluated on the BEA 2025 Track 1 benchmark using GPT-4o as the base model, our approach achieves state-of-the-art accuracy and F1 scores, outperforming all baselines. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract

This paper presents our system for Track 1: Mistake Identification in the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors. The task involves evaluating whether a tutor's response correctly identifies a mistake in a student's mathematical reasoning. We explore four approaches: (1) an ensemble of machine learning models over pooled token embeddings from multiple pretrained language models (LMs); (2) a frozen sentence-transformer using [CLS] embeddings with an MLP classifier; (3) a history-aware model with multi-head attention between token-level history and response embeddings; and (4) a retrieval-augmented few-shot prompting system with a large language model (LLM) i.e. GPT 4o. Our final system retrieves semantically similar examples, constructs structured prompts, and uses schema-guided output parsing to produce interpretable predictions. It outperforms all baselines, demonstrating the effectiveness of combining example-driven prompting with LLM reasoning for pedagogical feedback assessment. Our code is available at https://github.com/NaumanNaeem/BEA_2025.

Problem

Research questions and friction points this paper is trying to address.

Identify mistakes in AI tutor responses to student math reasoning

Compare machine learning and prompting approaches for error detection

Enhance pedagogical feedback using retrieval-augmented LLM prompting

Innovation

Methods, ideas, or system contributions that make the work stand out.

Ensemble of ML models with pooled embeddings

History-aware model with multi-head attention

Retrieval-augmented few-shot prompting with LLM

🔎 Similar Papers

No similar papers found.