π€ AI Summary
Large reasoning models (LRMs) exhibit weak pedagogical coherence, deficient knowledge-transmission logic, and inadequate simulation of teacher behaviors in educational settings.
Method: This paper proposes a teaching-aligned distillation fine-tuning paradigm: (1) constructing WBEBβthe first multi-dimensional benchmark for evaluating educational capabilities; (2) designing Chain-of-Pedagogy (CoP), a structured prompting strategy that emulates instructional reasoning; and (3) integrating model distillation with instruction tuning to explicitly model pedagogical behaviors. The approach combines quantitative evaluation and qualitative analysis across five core educational tasks.
Results: Experiments demonstrate significant improvements in teaching consistency, decision traceability, and reasoning plausibility. For the first time, this work systematically characterizes the strengths and critical limitations of LRMs in pedagogical competence, establishing both theoretical foundations and actionable technical pathways for trustworthy adaptation of large models to education.
π Abstract
Recent advances in large reasoning models (LRMs) show strong performance in structured domains such as mathematics and programming; however, they often lack pedagogical coherence and realistic teaching behaviors. To bridge this gap, we introduce Pedagogy-R1, a framework that adapts LRMs for classroom use through three innovations: (1) a distillation-based pipeline that filters and refines model outputs for instruction-tuning, (2) the Well-balanced Educational Benchmark (WBEB), which evaluates performance across subject knowledge, pedagogical knowledge, tracing, essay scoring, and teacher decision-making, and (3) a Chain-of-Pedagogy (CoP) prompting strategy for generating and eliciting teacher-style reasoning. Our mixed-method evaluation combines quantitative metrics with qualitative analysis, providing the first systematic assessment of LRMs' pedagogical strengths and limitations.