🤖 AI Summary
This study investigates whether mainstream large language models (LLMs) can effectively serve as AI tutors, specifically in mathematics education contexts involving error-based student–tutor dialogues.
Method: We propose the first pedagogically grounded evaluation framework for teaching competence, encompassing eight theoretically informed dimensions rooted in learning sciences. We introduce MRBench—the first multidimensional, human-annotated benchmark for educational dialogue (192 multi-turn dialogues, 1,596 responses)—enabling cross-dimensional, human–LLM comparable assessment. Our methodology integrates educational dialogue analysis, fine-grained human annotation, LLM-based meta-evaluation (using Prometheus2 and Llama-3.1-8B), and statistical reliability testing.
Contribution/Results: Empirical results reveal substantial divergence in LLMs’ pedagogical capabilities: some exhibit strong teaching adaptivity, while others function more effectively as answer-oriented systems. The proposed framework significantly enhances objectivity, reproducibility, and pedagogical relevance in AI tutor evaluation.
📝 Abstract
In this paper, we investigate whether current state-of-the-art large language models (LLMs) are effective as AI tutors and whether they demonstrate pedagogical abilities necessary for good AI tutoring in educational dialogues. Previous efforts towards evaluation have been limited to subjective protocols and benchmarks. To bridge this gap, we propose a unified evaluation taxonomy with eight pedagogical dimensions based on key learning sciences principles, which is designed to assess the pedagogical value of LLM-powered AI tutor responses grounded in student mistakes or confusions in the mathematical domain. We release MRBench - a new evaluation benchmark containing 192 conversations and 1,596 responses from seven state-of-the-art LLM-based and human tutors, providing gold annotations for eight pedagogical dimensions. We assess reliability of the popular Prometheus2 and Llama-3.1-8B LLMs as evaluators and analyze each tutor's pedagogical abilities, highlighting which LLMs are good tutors and which ones are more suitable as question-answering systems. We believe that the presented taxonomy, benchmark, and human-annotated labels will streamline the evaluation process and help track the progress in AI tutors' development.