🤖 AI Summary
Large language models (LLMs) exhibit significant limitations in generating age-appropriate dialogues for children in low-resource languages. Method: This study presents the first systematic evaluation of GPT-4, RUTER-LLAMA-2-13b, GPTSW, NorMistral-7b, and NorBloom-7b on Norwegian dialogue generation tailored to 5- and 9-year-old children. Evaluation combines authentic child language corpora with blinded expert assessments by education specialists (ICC = 0.75), focusing on developmental appropriateness and linguistic authenticity. Results: All models display pronounced adult-oriented biases and substantial age-inappropriateness; GPT-4 and NorBloom-7b achieve the highest performance, with notably better accuracy for the 5-year-old cohort. This work establishes the first empirical benchmark for child-directed dialogue generation in a low-resource language, revealing critical challenges—including age bias in training data and scarcity of child-aligned corpora—and provides actionable insights for developing pedagogically grounded LLM adaptations for young users.
📝 Abstract
Large Language Models (LLMs), predominantly trained on adult conversational data, face significant challenges when generating authentic, child-like dialogue for specialized applications. We present a comparative study evaluating five different LLMs (GPT-4, RUTER-LLAMA-2-13b, GPTSW, NorMistral-7b, and NorBloom-7b) to generate age-appropriate Norwegian conversations for children aged 5 and 9 years. Through a blind evaluation by eleven education professionals using both real child interview data and LLM-generated text samples, we assessed authenticity and developmental appropriateness. Our results show that evaluators achieved strong inter-rater reliability (ICC=0.75) and demonstrated higher accuracy in age prediction for younger children (5-year-olds) compared to older children (9-year-olds). While GPT-4 and NorBloom-7b performed relatively well, most models generated language perceived as more linguistically advanced than the target age groups. These findings highlight critical data-related challenges in developing LLM systems for specialized applications involving children, particularly in low-resource languages where comprehensive age-appropriate lexical resources are scarce.