Learning to Optimize Feedback for One Million Students: Insights from Multi-Armed and Contextual Bandits in Large-Scale Online Tutoring

📅 2025-07-31

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This study addresses the core problem of dynamically optimizing post-error feedback strategies to enhance learning outcomes in a large-scale online tutoring system serving millions of students. We propose a reinforcement learning framework integrating multi-armed bandits and contextual bandits, enabling real-time, adaptive selection of feedback strategies across 43,000 distinct prompt variants and 166,000 practice sessions—the first such deployment at this scale. We introduce an adaptive policy-target selection algorithm and leverage causal inference to quantify the marginal gain of personalized feedback, revealing that globally optimized policies often match or approach the performance of fully personalized ones. Experiments demonstrate statistically significant improvements in immediate response accuracy and session completion rates. The system has been deployed in production, serving thousands of students daily, thereby advancing the large-scale, data-driven implementation of pedagogical strategies.

Technology Category

Application Category

📝 Abstract

We present an online tutoring system that learns to provide effective feedback to students after they answer questions incorrectly. Using data from one million students, the system learns which assistance action (e.g., one of multiple hints) to provide for each question to optimize student learning. Employing the multi-armed bandit (MAB) framework and offline policy evaluation, we assess 43,000 assistance actions, and identify trade-offs between assistance policies optimized for different student outcomes (e.g., response correctness, session completion). We design an algorithm that for each question decides on a suitable policy training objective to enhance students' immediate second attempt success and overall practice session performance. We evaluate the resulting MAB policies in 166,000 practice sessions, verifying significant improvements in student outcomes. While MAB policies optimize feedback for the overall student population, we further investigate whether contextual bandit (CB) policies can enhance outcomes by personalizing feedback based on individual student features (e.g., ability estimates, response times). Using causal inference, we examine (i) how effects of assistance actions vary across students and (ii) whether CB policies, which leverage such effect heterogeneity, outperform MAB policies. While our analysis reveals that some actions for some questions exhibit effect heterogeneity, effect sizes may often be too small for CB policies to provide significant improvements beyond what well-optimized MAB policies that deliver the same action to all students already achieve. We discuss insights gained from deploying data-driven systems at scale and implications for future refinements. Today, the teaching policies optimized by our system support thousands of students daily.

Problem

Research questions and friction points this paper is trying to address.

Optimizing feedback actions for student learning outcomes

Personalizing feedback using student-specific contextual features

Evaluating MAB and CB policies in large-scale tutoring

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses multi-armed bandit for feedback optimization

Evaluates 43k actions via offline policy assessment

Tests contextual bandits for personalized student features

🔎 Similar Papers

No similar papers found.