🤖 AI Summary
This study systematically reviews AI-driven Intelligent Tutoring Systems (ITS) deployed in authentic educational settings from 2010 to 2025, addressing critical methodological gaps—including weak experimental designs, insufficiently rigorous data analysis, and ambiguous causal attribution of learning outcomes. To tackle these issues, we conduct a cross-disciplinary systematic literature review integrating perspectives from educational technology, natural language processing, adaptive learning, and student modeling. Our analysis uncovers structural bottlenecks in ITS effectiveness, particularly regarding pedagogical strategy adaptation, domain knowledge transfer, and longitudinal learning impact validation. We introduce, for the first time, a “Three-Dimensional Framework for Evaluation Rigor”—comprising ecological validity, causal inference, and multi-level evidence chains—and derive actionable improvement pathways and development guidelines grounded in this framework. The findings advance both theoretical understanding and methodological practice, supporting the transition of educational AI from technological demonstration to evidence-informed implementation.
📝 Abstract
AI-based Intelligent Tutoring Systems (ITS) have significant potential to transform teaching and learning. As efforts continue to design, develop, and integrate ITS into educational contexts, mixed results about their effectiveness have emerged. This paper provides a comprehensive review to understand how ITS operate in real educational settings and to identify the associated challenges in their application and evaluation. We use a systematic literature review method to analyze numerous qualified studies published from 2010 to 2025, examining domains such as pedagogical strategies, NLP, adaptive learning, student modeling, and domain-specific applications of ITS. The results reveal a complex landscape regarding the effectiveness of ITS, highlighting both advancements and persistent challenges. The study also identifies a need for greater scientific rigor in experimental design and data analysis. Based on these findings, suggestions for future research and practical implications are proposed.