From Problem-Solving to Teaching Problem-Solving: Aligning LLMs with Pedagogy using Reinforcement Learning

📅 2025-05-21

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Large language models (LLMs) deployed in education often provide direct answers, violating the pedagogical principle of scaffolded instruction. Method: We propose a teaching-aligned framework that trains LLMs via online reinforcement learning to simulate teacher-student interaction, replacing answer-output with guided problem-solving. Contribution/Results: (1) We introduce the first controllable multi-objective reward mechanism that explicitly models the Pareto trade-off between instructional support and solution accuracy; (2) We achieve efficient distillation of a 7B model into a pedagogically capable assistant using only synthetic data—no human annotations required; (3) The distilled model retains strong reasoning capabilities and supports interpretable, chain-of-thought–annotated instructional planning. Experiments show it matches commercial models (e.g., LearnLM) and significantly outperforms single-turn supervised fine-tuning baselines, establishing new state-of-the-art performance in both guidance quality and reasoning preservation.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) can transform education, but their optimization for direct question-answering often undermines effective pedagogy which requires strategically withholding answers. To mitigate this, we propose an online reinforcement learning (RL)-based alignment framework that can quickly adapt LLMs into effective tutors using simulated student-tutor interactions by emphasizing pedagogical quality and guided problem-solving over simply giving away answers. We use our method to train a 7B parameter tutor model without human annotations which reaches similar performance to larger proprietary models like LearnLM. We introduce a controllable reward weighting to balance pedagogical support and student solving accuracy, allowing us to trace the Pareto frontier between these two objectives. Our models better preserve reasoning capabilities than single-turn SFT baselines and can optionally enhance interpretability through thinking tags that expose the model's instructional planning.

Problem

Research questions and friction points this paper is trying to address.

Aligning LLMs with pedagogy using reinforcement learning

Balancing pedagogical support and student solving accuracy

Enhancing interpretability through instructional planning tags

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning aligns LLMs with pedagogy

Simulated student-tutor interactions train tutor models

Controllable reward balances pedagogy and accuracy

🔎 Similar Papers

Towards the Pedagogical Steering of Large Language Models for Tutoring: A Case Study with Modeling Productive Failure