The Missing Evaluation Axis: What 10,000 Student Submissions Reveal About AI Tutor Effectiveness

📅 2026-05-06

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This study addresses a critical gap in the evaluation of AI tutoring systems, which has traditionally emphasized the instructional quality of feedback while overlooking how students actually engage with and utilize it. The authors propose a novel assessment framework that integrates student behavioral data, introducing interaction-based dimensions to examine whether learners adopt and correctly apply AI-generated feedback. Through large-scale analysis of code submissions and interaction logs, the research demonstrates that behavioral signals are more effective than conventional instructional quality metrics in predicting students’ perceived usefulness of feedback. The framework’s validity is further confirmed across two consecutive semesters in authentic classroom settings, offering a new paradigm for the comprehensive evaluation of AI tutoring systems.

📝 Abstract

Current Artificial Intelligence (AI)-based tutoring systems (AI tutors) are primarily evaluated based on the pedagogical quality of their feedback messages. While important, pedagogy alone is insufficient because it ignores a critical question: what do students actually do with the feedback they receive? We argue that AI tutor evaluation should be extended with a behavioral dimension grounded in student interaction data, which complements pedagogical assessment. We propose an evaluation framework and apply it to 10,235 code submissions with corresponding AI tutor feedback from an introductory undergraduate programming course to measure whether students act on tutor feedback and whether those actions are applied correctly. Using this framework to compare two deployed AI tutors across different semesters in a large-scale introductory computer science course reveals substantial differences in student engagement patterns that are not captured by pedagogy-only evaluation. Moreover, these engagement-based behavioral signals are more strongly associated with student perception of helpful feedback than pedagogical quality alone, providing a more complete and actionable picture of AI tutor performance.

Problem

Research questions and friction points this paper is trying to address.

AI tutor evaluation

student feedback utilization

behavioral dimension

pedagogical quality

student engagement

Innovation

Methods, ideas, or system contributions that make the work stand out.

behavioral evaluation

AI tutoring systems

student engagement