🤖 AI Summary
Large language models (LLMs) deployed in education often provide direct answers, violating the pedagogical principle of scaffolded instruction.
Method: We propose a teaching-aligned framework that trains LLMs via online reinforcement learning to simulate teacher-student interaction, replacing answer-output with guided problem-solving.
Contribution/Results: (1) We introduce the first controllable multi-objective reward mechanism that explicitly models the Pareto trade-off between instructional support and solution accuracy; (2) We achieve efficient distillation of a 7B model into a pedagogically capable assistant using only synthetic data—no human annotations required; (3) The distilled model retains strong reasoning capabilities and supports interpretable, chain-of-thought–annotated instructional planning. Experiments show it matches commercial models (e.g., LearnLM) and significantly outperforms single-turn supervised fine-tuning baselines, establishing new state-of-the-art performance in both guidance quality and reasoning preservation.
📝 Abstract
Large language models (LLMs) can transform education, but their optimization for direct question-answering often undermines effective pedagogy which requires strategically withholding answers. To mitigate this, we propose an online reinforcement learning (RL)-based alignment framework that can quickly adapt LLMs into effective tutors using simulated student-tutor interactions by emphasizing pedagogical quality and guided problem-solving over simply giving away answers. We use our method to train a 7B parameter tutor model without human annotations which reaches similar performance to larger proprietary models like LearnLM. We introduce a controllable reward weighting to balance pedagogical support and student solving accuracy, allowing us to trace the Pareto frontier between these two objectives. Our models better preserve reasoning capabilities than single-turn SFT baselines and can optionally enhance interpretability through thinking tags that expose the model's instructional planning.