Unifying AI Tutor Evaluation: An Evaluation Taxonomy for Pedagogical Ability Assessment of LLM-Powered AI Tutors

📅 2024-12-12

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This study investigates whether mainstream large language models (LLMs) can effectively serve as AI tutors, specifically in mathematics education contexts involving error-based student–tutor dialogues. Method: We propose the first pedagogically grounded evaluation framework for teaching competence, encompassing eight theoretically informed dimensions rooted in learning sciences. We introduce MRBench—the first multidimensional, human-annotated benchmark for educational dialogue (192 multi-turn dialogues, 1,596 responses)—enabling cross-dimensional, human–LLM comparable assessment. Our methodology integrates educational dialogue analysis, fine-grained human annotation, LLM-based meta-evaluation (using Prometheus2 and Llama-3.1-8B), and statistical reliability testing. Contribution/Results: Empirical results reveal substantial divergence in LLMs’ pedagogical capabilities: some exhibit strong teaching adaptivity, while others function more effectively as answer-oriented systems. The proposed framework significantly enhances objectivity, reproducibility, and pedagogical relevance in AI tutor evaluation.

Technology Category

Application Category

📝 Abstract

In this paper, we investigate whether current state-of-the-art large language models (LLMs) are effective as AI tutors and whether they demonstrate pedagogical abilities necessary for good AI tutoring in educational dialogues. Previous efforts towards evaluation have been limited to subjective protocols and benchmarks. To bridge this gap, we propose a unified evaluation taxonomy with eight pedagogical dimensions based on key learning sciences principles, which is designed to assess the pedagogical value of LLM-powered AI tutor responses grounded in student mistakes or confusions in the mathematical domain. We release MRBench - a new evaluation benchmark containing 192 conversations and 1,596 responses from seven state-of-the-art LLM-based and human tutors, providing gold annotations for eight pedagogical dimensions. We assess reliability of the popular Prometheus2 and Llama-3.1-8B LLMs as evaluators and analyze each tutor's pedagogical abilities, highlighting which LLMs are good tutors and which ones are more suitable as question-answering systems. We believe that the presented taxonomy, benchmark, and human-annotated labels will streamline the evaluation process and help track the progress in AI tutors' development.

Problem

Research questions and friction points this paper is trying to address.

Evaluates LLM effectiveness as AI tutors

Proposes taxonomy for pedagogical ability assessment

Introduces MRBench for evaluating AI tutor responses

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified evaluation taxonomy

Pedagogical ability assessment

MRBench benchmark release

🔎 Similar Papers

Towards Responsible Development of Generative AI for Education: An Evaluation-Driven Approach