SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models

📅 2026-03-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of effective evaluation for multimodal large language models in understanding and responding to dynamic social interactions. To this end, we propose SocialOmni—the first multimodal benchmark specifically designed for audio-visual social interaction—systematically assessing model performance along three dimensions: speaker-separated recognition, interruption timing control, and natural interruption generation. The framework introduces temporally and contextually constrained interactive generation samples, robustness tests under audio-visual inconsistency, and reveals a critical disconnect between perception accuracy and interactive generation capability. Experiments across twelve state-of-the-art models demonstrate that conventional perception-based metrics fail to reflect genuine social interaction competence, thereby validating the necessity and effectiveness of the proposed benchmark.

Technology Category

Application Category

📝 Abstract
Omni-modal large language models (OLMs) redefine human-machine interaction by natively integrating audio, vision, and text. However, existing OLM benchmarks remain anchored to static, accuracy-centric tasks, leaving a critical gap in assessing social interactivity, the fundamental capacity to navigate dynamic cues in natural dialogues. To this end, we propose SocialOmni, a comprehensive benchmark that operationalizes the evaluation of this conversational interactivity across three core dimensions: (i) speaker separation and identification (who is speaking), (ii) interruption timing control (when to interject), and (iii) natural interruption generation (how to phrase the interruption). SocialOmni features 2,000 perception samples and a quality-controlled diagnostic set of 209 interaction-generation instances with strict temporal and contextual constraints, complemented by controlled audio-visual inconsistency scenarios to test model robustness. We benchmarked 12 leading OLMs, which uncovers significant variance in their social-interaction capabilities across models. Furthermore, our analysis reveals a pronounced decoupling between a model's perceptual accuracy and its ability to generate contextually appropriate interruptions, indicating that understanding-centric metrics alone are insufficient to characterize conversational social competence. More encouragingly, these diagnostics from SocialOmni yield actionable signals for bridging the perception-interaction divide in future OLMs.
Problem

Research questions and friction points this paper is trying to address.

social interactivity
omni-modal large language models
conversational interaction
audio-visual understanding
interruption generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

social interactivity
omni-modal LLMs
interruption generation
speaker identification
audio-visual benchmarking
🔎 Similar Papers
No similar papers found.
T
Tianyu Xie
Media Analytics and Computing Lab, Xiamen University, Xiamen, China
Jinfa Huang
Jinfa Huang
University of Rochester, Peking University
Vision and LanguageReasoning ModelsGenerative ModelsComputer Vision
Y
Yuexiao Ma
Media Analytics and Computing Lab, Xiamen University, Xiamen, China
R
Rongfang Luo
Sichuan Agricultural University, Yaan, China
Y
Yan Yang
Sichuan Agricultural University, Yaan, China
Wang Chen
Wang Chen
Individual Researcher
Natural Language ProcessingText GenerationInformation Extraction
Y
Yuhui Zeng
Media Analytics and Computing Lab, Xiamen University, Xiamen, China
R
Ruize Fang
Media Analytics and Computing Lab, Xiamen University, Xiamen, China
Y
Yixuan Zou
Media Analytics and Computing Lab, Xiamen University, Xiamen, China
Xiawu Zheng
Xiawu Zheng
Associate Professor, IEEE Senior Member, Xiamen University
Automated Machine LearningNetwork CompressionNeural Architecture SearchAutoML
J
Jiebo Luo
Department of Computer Science, University of Rochester, Rochester, NY, USA
R
Rongrong Ji
Media Analytics and Computing Lab, Xiamen University, Xiamen, China