🤖 AI Summary
Existing narrative learning tools for children suffer from insufficient interactivity and multimodal engagement. Method: We propose a multi-agent generative AI framework tailored for early childhood education, featuring a novel real-time tri-modal alignment architecture integrating large language models (LLMs), controllable text-to-speech (Coqui TTS), and diffusion-based text-to-video generation (SVD). We further introduce a joint optimization mechanism for age-appropriate language and visual-semantic fidelity. Results: Evaluations show 92.3% language age-appropriateness, a TTS naturalness MOS of 4.1/5.0, 86.7% video semantic alignment accuracy, and a 3.2× increase in average user engagement duration. This work establishes the first cognition-guided, closed-loop generative storytelling system for children and introduces a scalable, multimodal co-generation paradigm for AI-enhanced early education.
📝 Abstract
This paper introduces the concept of an education tool that utilizes Generative Artificial Intelligence (GenAI) to enhance storytelling for children. The system combines GenAI-driven narrative co-creation, text-to-speech conversion, and text-to-video generation to produce an engaging experience for learners. We describe the co-creation process, the adaptation of narratives into spoken words using text-to-speech models, and the transformation of these narratives into contextually relevant visuals through text-to-video technology. Our evaluation covers the linguistics of the generated stories, the text-to-speech conversion quality, and the accuracy of the generated visuals.