Findings of the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors

📅 2025-07-11

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This study addresses the lack of systematic evaluation of error-correction capabilities in LLM-driven AI tutors within pedagogical dialogues. Methodologically, we propose the first multidimensional automated assessment framework grounded in educational science principles, comprising five core tasks: error detection, precise localization, instructional guidance provision, feedback actionability, and tutor identity recognition. Assessment metrics are theoretically informed by learning sciences, and each dimension is quantified via multi-class classification, with macro-F1 serving as the unified evaluation metric. We publicly release a large-scale, expert-annotated dataset and benchmark to advance standardization in AI teaching assistant evaluation. Over 50 international teams participated in the evaluation; the top-performing models achieved macro-F1 scores of 58.34–71.81 across four pedagogical capability tasks and 96.98 on identity recognition—highlighting substantial room for improvement in error-correction proficiency.

Technology Category

Application Category

📝 Abstract

This shared task has aimed to assess pedagogical abilities of AI tutors powered by large language models (LLMs), focusing on evaluating the quality of tutor responses aimed at student's mistake remediation within educational dialogues. The task consisted of five tracks designed to automatically evaluate the AI tutor's performance across key dimensions of mistake identification, precise location of the mistake, providing guidance, and feedback actionability, grounded in learning science principles that define good and effective tutor responses, as well as the track focusing on detection of the tutor identity. The task attracted over 50 international teams across all tracks. The submitted models were evaluated against gold-standard human annotations, and the results, while promising, show that there is still significant room for improvement in this domain: the best results for the four pedagogical ability assessment tracks range between macro F1 scores of 58.34 (for providing guidance) and 71.81 (for mistake identification) on three-class problems, with the best F1 score in the tutor identification track reaching 96.98 on a 9-class task. In this paper, we overview the main findings of the shared task, discuss the approaches taken by the teams, and analyze their performance. All resources associated with this task are made publicly available to support future research in this critical domain.

Problem

Research questions and friction points this paper is trying to address.

Assess AI tutors' ability to remediate student mistakes

Evaluate AI tutors' performance in educational dialogue dimensions

Improve pedagogical abilities of LLM-powered AI tutors

Innovation

Methods, ideas, or system contributions that make the work stand out.

Assessed AI tutors using large language models

Evaluated pedagogical abilities across five tracks

Compared models with gold-standard human annotations

🔎 Similar Papers

Towards Responsible Development of Generative AI for Education: An Evaluation-Driven Approach