🤖 AI Summary
Existing models rely on future context, hindering real-time human–machine interaction. This paper introduces Online-MMSI—a novel paradigm for online multimodal social interaction understanding—requiring models to perform instantaneous inference solely from historical dialogue and streaming video. Our approach features three key innovations: (1) a coarse-to-fine progressive prediction mechanism for multi-role dialogues; (2) a socially aware visual prompting method that explicitly models interpersonal dynamics via human keypoints and object bounding boxes; and (3) a multimodal large language model enhanced with social-signal-driven spatiotemporal attention. Evaluated across three tasks on two benchmark datasets, our method significantly outperforms both offline and streaming baselines, achieving state-of-the-art performance. To foster reproducibility and further research, we release all code and pretrained models publicly.
📝 Abstract
Multimodal social interaction understanding (MMSI) is critical in human-robot interaction systems. In real-world scenarios, AI agents are required to provide real-time feedback. However, existing models often depend on both past and future contexts, which hinders them from applying to real-world problems. To bridge this gap, we propose an online MMSI setting, where the model must resolve MMSI tasks using only historical information, such as recorded dialogues and video streams. To address the challenges of missing the useful future context, we develop a novel framework, named Online-MMSI-VLM, that leverages two complementary strategies: multi-party conversation forecasting and social-aware visual prompting with multi-modal large language models. First, to enrich linguistic context, the multi-party conversation forecasting simulates potential future utterances in a coarse-to-fine manner, anticipating upcoming speaker turns and then generating fine-grained conversational details. Second, to effectively incorporate visual social cues like gaze and gesture, social-aware visual prompting highlights the social dynamics in video with bounding boxes and body keypoints for each person and frame. Extensive experiments on three tasks and two datasets demonstrate that our method achieves state-of-the-art performance and significantly outperforms baseline models, indicating its effectiveness on Online-MMSI. The code and pre-trained models will be publicly released at: https://github.com/Sampson-Lee/OnlineMMSI.