🤖 AI Summary
This study addresses a critical gap in the evaluation of AI teaching assistants, which has predominantly emphasized problem-solving accuracy or general safety while overlooking pedagogical harms arising from excessive prompting, reinforcement of misconceptions, or lack of scaffolding during instruction. To bridge this gap, the work proposes the first multi-turn interactive evaluation framework that integrates both instructional efficacy and safety. Grounded in learning science theory, it introduces a taxonomy of instructional safety comprising 11 harm categories and 48 sub-risks. Benchmarking across mathematics, physics, and chemistry reveals that all evaluated models exhibit significant instructional harms, with failure rates escalating from 17.7% in single-turn to 77.8% in multi-turn interactions. Moreover, harm patterns vary across disciplines, underscoring that model scale and single-turn assessments are insufficient for detecting systemic instructional failures.
📝 Abstract
Large language models are rapidly being deployed as AI tutors, yet current evaluation paradigms assess problem-solving accuracy and generic safety in isolation, failing to capture whether a model is simultaneously pedagogically effective and safe across student-tutor interaction. We argue that tutoring safety is fundamentally different from conventional LLM safety: the primary risk is not toxic content but the quiet erosion of learning through answer over-disclosure, misconception reinforcement, and the abdication of scaffolding. To systematically study this failure mode, we introduce SafeTutors, a benchmark that jointly evaluates safety and pedagogy across mathematics, physics, and chemistry. SafeTutors is organized around a theoretically grounded risk taxonomy comprising 11 harm dimensions and 48 sub-risks drawn from learning-science literature. We uncover that all models show broad harm; scale doesn't reliably help; and multi-turn dialogue worsens behavior, with pedagogical failures rising from 17.7% to 77.8%. Harms also vary by subject, so mitigations must be discipline-aware, and single-turn "safe/helpful" results can mask systematic tutor failure over extended interaction.