Omni-MMSI: Toward Identity-attributed Social Interaction Understanding

📅 2026-03-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current AI systems struggle to accurately interpret socially grounded identity-aware interactions—such as “who is speaking to whom”—from raw audiovisual inputs, limiting their reliability in real-world scenarios. This work introduces Omni-MMSI, the first end-to-end task formulation designed to jointly perceive identity cues and reason about social interaction structures directly from multimodal inputs. To address this challenge, we propose the Omni-MMSI-R framework, which integrates a reference-guided mechanism with participant-level reference pairs, leveraging a multimodal large language model, an identity-attribution toolchain, and chain-based social reasoning to overcome conventional reliance on pre-extracted cues. Experimental results demonstrate that Omni-MMSI-R substantially outperforms existing baselines, exhibiting markedly enhanced capability in understanding social interactions in authentic settings.
📝 Abstract
We introduce Omni-MMSI, a new task that requires comprehensive social interaction understanding from raw audio, vision, and speech input. The task involves perceiving identity-attributed social cues (e.g., who is speaking what) and reasoning about the social interaction (e.g., whom the speaker refers to). This task is essential for developing AI assistants that can perceive and respond to human interactions. Unlike prior studies that operate on oracle-preprocessed social cues, Omni-MMSI reflects realistic scenarios where AI assistants must perceive and reason from raw data. However, existing pipelines and multi-modal LLMs perform poorly on Omni-MMSI because they lack reliable identity attribution capabilities, which leads to inaccurate social interaction understanding. To address this challenge, we propose Omni-MMSI-R, a reference-guided pipeline that produces identity-attributed social cues with tools and conducts chain-of-thought social reasoning. To facilitate this pipeline, we construct participant-level reference pairs and curate reasoning annotations on top of the existing datasets. Experiments demonstrate that Omni-MMSI-R outperforms advanced LLMs and counterparts on Omni-MMSI. Project page: https://sampson-lee.github.io/omni-mmsi-project-page.
Problem

Research questions and friction points this paper is trying to address.

social interaction understanding
identity attribution
multimodal perception
raw audio-visual input
AI assistants
Innovation

Methods, ideas, or system contributions that make the work stand out.

identity-attributed social interaction
multi-modal LLM
reference-guided pipeline
social reasoning
Omni-MMSI
🔎 Similar Papers
No similar papers found.