🤖 AI Summary
This study addresses a critical limitation in existing child language assessment metrics—such as mean length of utterance—which disregard conversational context and fail to capture key developmental dimensions like inferential depth, topic maintenance, and discourse progression. To overcome this, the authors propose a context-aware evaluation framework grounded in large language models. The approach first classifies the type of the preceding adult utterance and then scores children’s responses along two dimensions: elaborativeness (reflecting contextual inference and elaboration) and independence (indicating autonomous discourse advancement). For the first time, elaborativeness and independence are operationalized as developmentally sensitive indicators, moving beyond traditional reliance on utterance length. Empirical results demonstrate that the proposed metrics significantly outperform baseline measures in age prediction tasks, effectively distinguish semantic differences across conversational contexts, and exhibit strong alignment with human judgments.
📝 Abstract
Evaluating the quality of children's utterances in adult-child dialogue remains challenging due to insufficient context-sensitive metrics. Common proxies such as Mean Length of Utterance (MLU), lexical diversity (vocd-D), and readability indices (Flesch-Kincaid Grade Level, Gunning Fog Index) are dominated by length and ignore conversational context, missing aspects of response quality such as reasoning depth, topic maintenance, and discourse planning. We introduce an LLM-as-a-judge framework that first classifies the Previous Adult Utterance Type and then scores the child's response along two axes: Expansion (contextual elaboration and inferential depth) and Independence (the child's contribution to advancing the discourse). These axes reflect fundamental dimensions in child language development, where Expansion captures elaboration, clause combining, and causal and contrastive connectives. Independence captures initiative, topic control, decreasing reliance on adult scaffolding through growing self-regulation, and audience design. We establish developmental validity by showing age-related patterns and demonstrate predictive value by improving age estimation over common baselines. We further confirm semantic sensitivity by detecting differences tied to discourse relations. Our metrics align with human judgments, enabling large-scale evaluation. This shifts child utterance assessment from simply measuring length to evaluating how meaningfully the child's speech contributes to and advances the conversation within its context.