🤖 AI Summary
This study addresses the challenge of scaling and sustaining high-quality monitoring of teacher–child interactions within China’s vast preschool system, which serves 36 million children and where traditional expert evaluations are impractical for routine use. To this end, the authors construct TEPE-TCI-370h, the first large-scale Chinese-language dataset of kindergarten teacher–child interactions, and introduce Interaction2Eval—a novel framework tailored for early childhood education that integrates large language models, child speech recognition, homophone disambiguation for Mandarin, and rubric-based reasoning to automatically extract structured quality indicators from naturalistic classroom interactions. Empirical validation across 43 classrooms demonstrates an 18-fold increase in assessment efficiency and 88% agreement with expert judgments, enabling a shift from annual audits to frequent, low-cost, and precise monthly AI-assisted quality monitoring.
📝 Abstract
High-quality teacher-child interaction (TCI) is fundamental to early childhood development, yet traditional expert-based assessment faces a critical scalability challenge. In large systems like China's-serving 36 million children across 250,000+ kindergartens-the cost and time requirements of manual observation make continuous quality monitoring infeasible, relegating assessment to infrequent episodic audits that limit timely intervention and improvement tracking.
In this paper, we investigate whether AI can serve as a scalable assessment teammate by extracting structured quality indicators and validating their alignment with human expert judgments.
Our contributions include: (1) TEPE-TCI-370h (Tracing Effective Preschool Education), the first large-scale dataset of naturalistic teacher-child interactions in Chinese preschools (370 hours, 105 classrooms) with standardized ECQRS-EC and SSTEW annotations; (2) We develop Interaction2Eval, a specialized LLM-based framework addressing domain-specific challenges-child speech recognition, Mandarin homophone disambiguation, and rubric-based reasoning-achieving up to 88% agreement; (3) Deployment validation across 43 classrooms demonstrating an 18x efficiency gain in the assessment workflow, highlighting its potential for shifting from annual expert audits to monthly AI-assisted monitoring with targeted human oversight. This work not only demonstrates the technical feasibility of scalable, AI-augmented quality assessment but also lays the foundation for a new paradigm in early childhood education-one where continuous, inclusive, AI-assisted evaluation becomes the engine of systemic improvement and equitable growth.