🤖 AI Summary
Current LLM-based chatbots generate coherent textual responses but struggle to determine the appropriate timing and type of short, naturalistic conversational responses—such as “um,” head nods, or smiles—due to their exclusive reliance on text and absence of multimodal rhythmic cues (e.g., audio prosody, visual turn-taking signals) present in real human dialogue. To address this, we propose MM-When2Speak, the first model designed for joint prediction of response timing and type via multimodal temporal alignment. It introduces a novel, manually annotated multimodal video dataset of natural dialogues with precise frame-level timing labels. The architecture integrates audio, visual, and textual modalities within a multimodal large language model framework, employing fine-grained temporal alignment and cross-modal dynamic fusion. Experiments demonstrate that MM-When2Speak achieves a fourfold improvement in response timing accuracy over state-of-the-art commercial LLMs, and significantly outperforms unimodal and text-only baselines on both short-response classification and temporal localization tasks.
📝 Abstract
While large language model (LLM)-based chatbots have demonstrated strong capabilities in generating coherent and contextually relevant responses, they often struggle with understanding when to speak, particularly in delivering brief, timely reactions during ongoing conversations. This limitation arises largely from their reliance on text input, lacking the rich contextual cues in real-world human dialogue. In this work, we focus on real-time prediction of response types, with an emphasis on short, reactive utterances that depend on subtle, multimodal signals across vision, audio, and text. To support this, we introduce a new multimodal dataset constructed from real-world conversational videos, containing temporally aligned visual, auditory, and textual streams. This dataset enables fine-grained modeling of response timing in dyadic interactions. Building on this dataset, we propose MM-When2Speak, a multimodal LLM-based model that adaptively integrates visual, auditory, and textual context to predict when a response should occur, and what type of response is appropriate. Experiments show that MM-When2Speak significantly outperforms state-of-the-art unimodal and LLM-based baselines, achieving up to a 4x improvement in response timing accuracy over leading commercial LLMs. These results underscore the importance of multimodal inputs for producing timely, natural, and engaging conversational AI.