🤖 AI Summary
This work addresses the limitations of existing evaluation methods in comprehensively assessing the multi-turn interactive and pedagogical capabilities of large language models (LLMs) in K–8 mathematics instruction. To this end, we propose KMP-Bench, a novel benchmark that introduces the first multi-turn dialogue evaluation framework grounded in six core teaching principles, along with KMP-Pile, a large-scale instructional dialogue dataset. By integrating techniques such as multi-turn dialogue construction, teaching-principle-aligned evaluation, error detection and correction, problem generation, and model fine-tuning, our approach systematically enhances LLMs’ instructional competence. Experimental results demonstrate that while mainstream LLMs perform well on verifiable tasks, they exhibit significant deficiencies in applying established teaching principles; however, fine-tuning on KMP-Pile substantially improves their pedagogical performance on KMP-Bench.
📝 Abstract
Large Language Models (LLMs) show significant potential in AI mathematical tutoring, yet current evaluations often rely on simplistic metrics or narrow pedagogical scenarios, failing to assess comprehensive, multi-turn teaching effectiveness. In this paper, we introduce KMP-Bench, a comprehensive K-8 Mathematical Pedagogical Benchmark designed to assess LLMs from two complementary perspectives. The first module, KMP-Dialogue, evaluates holistic pedagogical capabilities against six core principles (e.g., Challenge, Explanation, Feedback), leveraging a novel multi-turn dialogue dataset constructed by weaving together diverse pedagogical components. The second module, KMP-Skills, provides a granular assessment of foundational tutoring abilities, including multi-turn problem-solving, error detection and correction, and problem generation. Our evaluations on KMP-Bench reveal a key disparity: while leading LLMs excel at tasks with verifiable solutions, they struggle with the nuanced application of pedagogical principles. Additionally, we present KMP-Pile, a large-scale (150K) dialogue dataset. Models fine-tuned on KMP-Pile show substantial improvement on KMP-Bench, underscoring the value of pedagogically-rich training data for developing more effective AI math tutors.