🤖 AI Summary
This study addresses the challenging problem of modeling infant language acquisition by formalizing lexical acquisition as a spatiotemporal graph learning task—the first such formulation. We propose a developmentally informed, relation-weighted Spatiotemporal Graph Convolutional Network (STGCN) that jointly models four multimodal inter-word relations: semantic, sensorimotor, visual, and auditory, while capturing the dynamic evolution of lexical mastery states. Our contributions are threefold: (1) constructing the first multimodal, temporally explicit graph structure for infant vocabulary acquisition; (2) introducing a relation-specific weighting mechanism to quantify differential predictive contributions of distinct linguistic relations; and (3) empirically validating effectiveness on real developmental data—achieving prediction accuracies of 0.733 (sensorimotor) and 0.729 (semantic), significantly outperforming baselines; visual relations yield the highest recall, enhancing coverage of potential target words.
📝 Abstract
Predicting the words that a child is going to learn next can be useful for boosting language acquisition, and such predictions have been shown to be possible with both neural network techniques (looking at changes in the vocabulary state over time) and graph model (looking at data pertaining to the relationships between words). However, these models do not fully capture the complexity of the language learning process of an infant when used in isolation. In this paper, we examine how a model of language acquisition for infants and young children can be constructed and adapted for use in a Spatio-Temporal Graph Convolutional Network (STGCN), taking into account the different types of linguistic relationships that occur during child language learning. We introduce a novel approach for predicting child vocabulary acquisition, and evaluate the efficacy of such a model with respect to the different types of linguistic relationships that occur during language acquisition, resulting in insightful observations on model calibration and norm selection. An evaluation of this model found that the mean accuracy of models for predicting new words when using sensorimotor relationships (0.733) and semantic relationships (0.729) were found to be superior to that observed with a 2-layer Feed-forward neural network. Furthermore, the high recall for some relationships suggested that some relationships (e.g. visual) were superior in identifying a larger proportion of relevant words that a child should subsequently learn than others (such as auditory).