🤖 AI Summary
This study addresses the lack of systematic, interpretable evaluation of AI teaching assistants’ instructional quality. We propose the first standardized assessment framework that jointly evaluates pedagogical effectiveness and model interpretability. Methodologically, we develop an open-source, language-technology-driven evaluation toolkit integrating NLP-based analysis, model attribution techniques, interactive visualization, and user feedback annotation—enabling multi-scenario evaluation of AI tutors. Our contributions are threefold: (1) the first integration of educational validity metrics with explainable AI (XAI) methods to establish fine-grained, pedagogy-oriented evaluation dimensions; (2) an end-to-end software tool supporting model behavior diagnostics, pedagogical strategy attribution, and data-driven optimization; and (3) significantly enhanced transparency, auditability, and practical adaptability of educational AI systems—already deployed for educators and the ACL community.
📝 Abstract
We present AITutor-EvalKit, an application that uses language technology to evaluate the pedagogical quality of AI tutors, provides software for demonstration and evaluation, as well as model inspection and data visualization. This tool is aimed at education stakeholders as well as *ACL community at large, as it supports learning and can also be used to collect user feedback and annotations.