Efficient RL for optimizing conversation level outcomes with an LLM-based tutor

📅 2025-07-22

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing RLHF methods for LLM-based math tutoring optimize only single-turn responses, failing to align with long-term pedagogical objectives—such as fostering students’ independent problem-solving ability. To address this, we propose a latent-state-based long-horizon policy optimization framework. Our approach encodes multi-turn dialogue history into a low-dimensional latent representation of student state and directly generates high-level pedagogical actions (e.g., “scaffold,” “probe,” “verify”)—bypassing fine-grained language modeling and substantially reducing computational overhead. Crucially, we decouple teaching strategy learning into two modular components: state perception and action planning, enabling explicit optimization toward long-term learning goals. Experiments on simulated tutoring tasks demonstrate that our method significantly improves students’ final problem-solving performance, outperforming both conventional prompt engineering and single-turn RLHF baselines.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) built on existing reinforcement learning with human feedback (RLHF) frameworks typically optimize responses based on immediate turn-level human preferences. However, this approach falls short in multi-turn dialogue settings, such as online math tutoring. We propose a method to enhance LLM-based tutors by representing the dialogue history with a lower-dimensional latent state representation of a student and optimizing a long-term policy to determine high-level actions based on the latent state. The goal is to better align the tutor's behavior with the long-term objective of guiding the student towards solving a target math problem on their own. Our model is lightweight, requiring less computational resources than prior work of training the tutor policy end-to-end to directly output the tutor's next utterance. Our experiment results demonstrate that these modifications lead to improved long-term outcomes compared to prompting in LLM-simulated tutoring tasks.

Problem

Research questions and friction points this paper is trying to address.

Optimize multi-turn dialogue outcomes in LLM-based tutoring

Align tutor behavior with long-term student problem-solving goals

Reduce computational resources for training tutor policies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses latent state for student representation

Optimizes long-term policy for tutoring

Lightweight model reduces computational resources

🔎 Similar Papers

No similar papers found.

Authors to Follow