SayNext-Bench: Why Do LLMs Struggle with Next-Utterance Prediction?

📅 2026-01-30

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Current large language models exhibit limited capability in predicting a human’s next utterance by leveraging multimodal cues such as gestures, gaze, and affective signals. To address this gap, this work introduces SayNext-PC, a large-scale multimodal conversational dataset, along with SayNext-Bench, a comprehensive evaluation benchmark. Furthermore, it proposes SayNext-Chat, a cognitive-inspired dual-path multimodal large language model designed to emulate the predictive processing mechanisms underlying human dialogue. Experimental results demonstrate that SayNext-Chat outperforms existing approaches across lexical overlap, semantic similarity, and emotional consistency metrics. This study provides the first systematic validation of the critical role multimodal cues play in naturalistic conversation prediction, offering a novel paradigm for embodied interactive intelligence.

Technology Category

Application Category

📝 Abstract

We explore the use of large language models (LLMs) for next-utterance prediction in human dialogue. Despite recent advances in LLMs demonstrating their ability to engage in natural conversations with users, we show that even leading models surprisingly struggle to predict a human speaker's next utterance. Instead, humans can readily anticipate forthcoming utterances based on multimodal cues, such as gestures, gaze, and emotional tone, from the context. To systematically examine whether LLMs can reproduce this ability, we propose SayNext-Bench, a benchmark that evaluates LLMs and Multimodal LLMs (MLLMs) on anticipating context-conditioned responses from multimodal cues spanning a variety of real-world scenarios. To support this benchmark, we build SayNext-PC, a novel large-scale dataset containing dialogues with rich multimodal cues. Building on this, we further develop a dual-route prediction MLLM, SayNext-Chat, that incorporates cognitively inspired design to emulate predictive processing in conversation. Experimental results demonstrate that our model outperforms state-of-the-art MLLMs in terms of lexical overlap, semantic similarity, and emotion consistency. Our results prove the feasibility of next-utterance prediction with LLMs from multimodal cues and emphasize the (i) indispensable role of multimodal cues and (ii) actively predictive processing as the foundation of natural human interaction, which is missing in current MLLMs. We hope that this exploration offers a new research entry toward more human-like, context-sensitive AI interaction for human-centered AI. Our benchmark and model can be accessed at https://saynext.github.io/.

Problem

Research questions and friction points this paper is trying to address.

next-utterance prediction

large language models

multimodal cues

human dialogue

predictive processing

Innovation

Methods, ideas, or system contributions that make the work stand out.

next-utterance prediction

multimodal cues

predictive processing