๐ค AI Summary
This work proposes PedagogicalRL-Thinking, a novel framework that integrates educational theory into the internal reasoning mechanisms of large language models (LLMs), addressing the common oversight in existing approaches that prioritize output correctness over pedagogically sound reasoning processes. By incorporating instruction-guided reasoning prompts and a reinforcement learningโbased reward mechanism tailored to teaching principles, the framework jointly fine-tunes LLMs to optimize their reasoning trajectories. Evaluated on math tutoring tasks, the resulting models not only achieve significant performance gains on unseen educational benchmarks but also retain their original factual knowledge while generating reasoning steps that exhibit greater pedagogical structure and logical coherence.
๐ Abstract
Large language models (LLMs) are increasingly deployed as intelligent tutoring systems, yet research on optimizing LLMs specifically for educational contexts remains limited. Recent works have proposed reinforcement learning approaches for training LLM tutors, but these methods focus solely on optimizing visible responses while neglecting the model's internal thinking process. We introduce PedagogicalRL-Thinking, a framework that extends pedagogical alignment to reasoning LLMs in education through two novel approaches: (1) Pedagogical Reasoning Prompting, which guides internal reasoning using domain-specific educational theory rather than generic instructions; and (2) Thinking Reward, which explicitly evaluates and reinforces the pedagogical quality of the model's reasoning traces. Our experiments reveal that domain-specific, theory-grounded prompting outperforms generic prompting, and that Thinking Reward is most effective when combined with pedagogical prompting. Furthermore, models trained only on mathematics tutoring dialogues show improved performance on educational benchmarks not seen during training, while preserving the base model's factual knowledge. Our quantitative and qualitative analyses reveal that pedagogical thinking reward produces systematic reasoning trace changes, with increased pedagogical reasoning and more structured instructional decision-making in the tutor's thinking process.