From Solver to Tutor: Evaluating the Pedagogical Intelligence of LLMs with KMP-Bench

📅 2026-03-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing evaluation methods in comprehensively assessing the multi-turn interactive and pedagogical capabilities of large language models (LLMs) in K–8 mathematics instruction. To this end, we propose KMP-Bench, a novel benchmark that introduces the first multi-turn dialogue evaluation framework grounded in six core teaching principles, along with KMP-Pile, a large-scale instructional dialogue dataset. By integrating techniques such as multi-turn dialogue construction, teaching-principle-aligned evaluation, error detection and correction, problem generation, and model fine-tuning, our approach systematically enhances LLMs’ instructional competence. Experimental results demonstrate that while mainstream LLMs perform well on verifiable tasks, they exhibit significant deficiencies in applying established teaching principles; however, fine-tuning on KMP-Pile substantially improves their pedagogical performance on KMP-Bench.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) show significant potential in AI mathematical tutoring, yet current evaluations often rely on simplistic metrics or narrow pedagogical scenarios, failing to assess comprehensive, multi-turn teaching effectiveness. In this paper, we introduce KMP-Bench, a comprehensive K-8 Mathematical Pedagogical Benchmark designed to assess LLMs from two complementary perspectives. The first module, KMP-Dialogue, evaluates holistic pedagogical capabilities against six core principles (e.g., Challenge, Explanation, Feedback), leveraging a novel multi-turn dialogue dataset constructed by weaving together diverse pedagogical components. The second module, KMP-Skills, provides a granular assessment of foundational tutoring abilities, including multi-turn problem-solving, error detection and correction, and problem generation. Our evaluations on KMP-Bench reveal a key disparity: while leading LLMs excel at tasks with verifiable solutions, they struggle with the nuanced application of pedagogical principles. Additionally, we present KMP-Pile, a large-scale (150K) dialogue dataset. Models fine-tuned on KMP-Pile show substantial improvement on KMP-Bench, underscoring the value of pedagogically-rich training data for developing more effective AI math tutors.
Problem

Research questions and friction points this paper is trying to address.

pedagogical evaluation
large language models
mathematical tutoring
multi-turn dialogue
teaching effectiveness
Innovation

Methods, ideas, or system contributions that make the work stand out.

pedagogical intelligence
KMP-Bench
multi-turn dialogue evaluation
mathematical tutoring
KMP-Pile
W
Weikang Shi
Multimedia Laboratory (MMLab), The Chinese University of Hong Kong
Houxing Ren
Houxing Ren
Beihang University
J
Junting Pan
Multimedia Laboratory (MMLab), The Chinese University of Hong Kong
Aojun Zhou
Aojun Zhou
The Chinese University of Hong Kong
Deep Learning
Ke Wang
Ke Wang
The Chinese University of Hong Kong
Large Language ModelMachine Learning
Zimu Lu
Zimu Lu
Ph.D. student at the Chinese University of Hong Kong
AI ReasoningLarge Language Model
Yunqiao Yang
Yunqiao Yang
City University of Hong Kong
Transfer LearningMachine Learning
Yuxuan Hu
Yuxuan Hu
The Chinese University of Hong Kong
multimodalLLM
L
Linda Wei
Multimedia Laboratory (MMLab), The Chinese University of Hong Kong
M
Mingjie Zhan
Multimedia Laboratory (MMLab), The Chinese University of Hong Kong
H
Hongsheng Li
Multimedia Laboratory (MMLab), The Chinese University of Hong Kong