Evaluating LLMs on Generating Age-Appropriate Child-Like Conversations

📅 2025-10-28

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Large language models (LLMs) exhibit significant limitations in generating age-appropriate dialogues for children in low-resource languages. Method: This study presents the first systematic evaluation of GPT-4, RUTER-LLAMA-2-13b, GPTSW, NorMistral-7b, and NorBloom-7b on Norwegian dialogue generation tailored to 5- and 9-year-old children. Evaluation combines authentic child language corpora with blinded expert assessments by education specialists (ICC = 0.75), focusing on developmental appropriateness and linguistic authenticity. Results: All models display pronounced adult-oriented biases and substantial age-inappropriateness; GPT-4 and NorBloom-7b achieve the highest performance, with notably better accuracy for the 5-year-old cohort. This work establishes the first empirical benchmark for child-directed dialogue generation in a low-resource language, revealing critical challenges—including age bias in training data and scarcity of child-aligned corpora—and provides actionable insights for developing pedagogically grounded LLM adaptations for young users.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs), predominantly trained on adult conversational data, face significant challenges when generating authentic, child-like dialogue for specialized applications. We present a comparative study evaluating five different LLMs (GPT-4, RUTER-LLAMA-2-13b, GPTSW, NorMistral-7b, and NorBloom-7b) to generate age-appropriate Norwegian conversations for children aged 5 and 9 years. Through a blind evaluation by eleven education professionals using both real child interview data and LLM-generated text samples, we assessed authenticity and developmental appropriateness. Our results show that evaluators achieved strong inter-rater reliability (ICC=0.75) and demonstrated higher accuracy in age prediction for younger children (5-year-olds) compared to older children (9-year-olds). While GPT-4 and NorBloom-7b performed relatively well, most models generated language perceived as more linguistically advanced than the target age groups. These findings highlight critical data-related challenges in developing LLM systems for specialized applications involving children, particularly in low-resource languages where comprehensive age-appropriate lexical resources are scarce.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to generate authentic child-like conversations

Assessing age-appropriate Norwegian dialogue for children applications

Addressing data scarcity challenges for specialized child-focused LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated five LLMs for child dialogue generation

Used blind professional evaluation with real child data

Assessed age-appropriate Norwegian conversations for children

🔎 Similar Papers

No similar papers found.