Training LLM-based Tutors to Improve Student Learning Outcomes in Dialogues

📅 2025-03-09

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

Existing LLM-based AI tutors prioritize linguistic quality over direct optimization of student learning outcomes—such as answer accuracy—leading to suboptimal pedagogical effectiveness. Method: We propose a teaching-effectiveness–centric paradigm: (1) building an LLM-based student model to predict response efficacy; (2) generating high-quality preference data via GPT-4o–guided pedagogical principle scoring; and (3) applying DPO to fine-tune Llama 3.1 8B. Contribution/Results: This is the first work to jointly model student learning outcomes and pedagogical fidelity in an end-to-end trainable framework. Experiments demonstrate statistically significant improvements in student answer accuracy, with teaching performance on par with GPT-4o. Human evaluation confirms that our method produces responses exhibiting both high instructional effectiveness and naturalness—outperforming baselines in both objective and subjective metrics.

Technology Category

Application Category

📝 Abstract

Generative artificial intelligence (AI) has the potential to scale up personalized tutoring through large language models (LLMs). Recent AI tutors are adapted for the tutoring task by training or prompting LLMs to follow effective pedagogical principles, though they are not trained to maximize student learning throughout the course of a dialogue. Therefore, they may engage with students in a suboptimal way. We address this limitation by introducing an approach to train LLMs to generate tutor utterances that maximize the likelihood of student correctness, while still encouraging the model to follow good pedagogical practice. Specifically, we generate a set of candidate tutor utterances and score them using (1) an LLM-based student model to predict the chance of correct student responses and (2) a pedagogical rubric evaluated by GPT-4o. We then use the resulting data to train an open-source LLM, Llama 3.1 8B, using direct preference optimization. We show that tutor utterances generated by our model lead to significantly higher chances of correct student responses while maintaining the pedagogical quality of GPT-4o. We also conduct qualitative analyses and a human evaluation to demonstrate that our model generates high quality tutor utterances.

Problem

Research questions and friction points this paper is trying to address.

Improve student learning outcomes in dialogues using LLM-based tutors.

Train LLMs to maximize student correctness and follow pedagogical principles.

Enhance tutor utterances quality while maintaining pedagogical effectiveness.

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based student model predicts correct responses

Pedagogical rubric evaluated by GPT-4o

Direct preference optimization trains Llama 3.1 8B

🔎 Similar Papers

Towards the Pedagogical Steering of Large Language Models for Tutoring: A Case Study with Modeling Productive Failure