PEARL: Training Socratic Tutors with Pedagogically Aligned Reinforcement Learning

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This work addresses the challenge that current large language models struggle to effectively implement Socratic tutoring in educational settings, primarily due to low-fidelity student simulation, ambiguous modeling of pedagogical rewards, and instability in multi-objective optimization. To overcome these limitations, the study proposes three key innovations: a controllable student simulator that decouples cognitive states from response generation, a generative reward model that jointly evaluates both the quality of instructional guidance and answer correctness, and a multi-objective reinforcement learning framework leveraging discretized and normalized advantage aggregation. The resulting 30B-parameter intelligent tutoring system, trained with this approach, significantly outperforms existing open-source models across multiple benchmarks and achieves tutoring performance on par with leading closed-source large language models.

📝 Abstract

Large Language Models (LLMs) have shown promise as educational tutors, yet effective tutoring requires more than solving problems: it must provide progressive Socratic guidance and balance multiple pedagogical objectives across multi-turn interactions. However, training such tutors remains challenging due to limited-fidelity and weakly controllable student simulation, under-specified pedagogical reward modeling, and unstable multi-objective optimization. To overcome these limitations, we propose PEARL, a pedagogically aligned reinforcement learning framework for training Socratic tutoring agents, consisting of three key components. First, we introduce a controllable student simulator that decouples latent cognitive states from response generation to model diverse abilities and misconceptions. Second, we develop a generative reward model that jointly evaluates pedagogical quality and objective correctness for policy optimization. Finally, we propose a stable multi-objective RL scheme that discretizes rewards within each dimension and aggregates normalized advantages across dimensions, preventing high-variance objectives from dominating updates. Experiments on multiple benchmarks show that PEARL achieves the best performance among open-source models and remains competitive with leading proprietary LLMs, despite using only a 30B policy model.

Problem

Research questions and friction points this paper is trying to address.

Socratic tutoring

pedagogical objectives

student simulation

multi-objective reinforcement learning

reward modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Socratic tutoring

pedagogically aligned reinforcement learning

controllable student simulator