🤖 AI Summary
This study addresses the real-time prediction of listeners’ comprehension states—namely, understanding, partial understanding, non-understanding, and misunderstanding—in explanatory dialogues. The authors propose a cognitive modeling approach that integrates linguistic and non-linguistic cues, uniquely combining three types of cognitive load–related features: information value, syntactic complexity, and interactive gaze behavior. Leveraging the MUNDEX corpus, they employ both statistical analysis and machine learning methods, including general-purpose classifiers and a fine-tuned German BERT-based multimodal model. Experimental results demonstrate that all three feature categories are significantly associated with comprehension states, and their integration consistently improves prediction accuracy across the four-state classification task. These findings validate the efficacy of multidimensional cognitive load indicators for fine-grained modeling of listener understanding in dialogue.
📝 Abstract
We investigate how verbal and nonverbal linguistic features, exhibited by speakers and listeners in dialogue, can contribute to predicting the listener's state of understanding in explanatory interactions on a moment-by-moment basis. Specifically, we examine three linguistic cues related to cognitive load and hypothesised to correlate with listener understanding: the information value (operationalised with surprisal) and syntactic complexity of the speaker's utterances, and the variation in the listener's interactive gaze behaviour. Based on statistical analyses of the MUNDEX corpus of face-to-face dialogic board game explanations, we find that individual cues vary with the listener's level of understanding. Listener states ('Understanding', 'Partial Understanding', 'Non-Understanding' and 'Misunderstanding') were self-annotated by the listeners using a retrospective video-recall method. The results of a subsequent classification experiment, involving two off-the-shelf classifiers and a fine-tuned German BERT-based multimodal classifier, demonstrate that prediction of these four states of understanding is generally possible and improves when the three linguistic cues are considered alongside textual features.