OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts

πŸ“… 2025-03-29
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing video benchmarks inadequately evaluate multimodal large language models’ (MLLMs) real-time interactive capabilities and active reasoning in streaming video. This paper introduces the first multimodal interactive evaluation benchmark tailored for streaming video, comprising over 1,121 videos and 2,290 questions, explicitly targeting two core challenges: streaming understanding and active reasoning. We propose a novel active reasoning evaluation paradigm and the M4 multimodal reuse modeling framework, which jointly integrates visual, auditory, and linguistic modalities via streaming chunked encoding, cross-modal temporal alignment, incremental response generation, and dynamic attention gating. We systematically evaluate 12 state-of-the-art OmniLLMs across six fine-grained subtasks, uncovering critical bottlenecks in real-world streaming interaction. Empirical results show that M4 achieves a 2.3Γ— speedup and 37% memory reduction over baselines while maintaining 98.5% interactive accuracy.

Technology Category

Application Category

πŸ“ Abstract
The rapid advancement of multi-modal language models (MLLMs) like GPT-4o has propelled the development of Omni language models, designed to process and proactively respond to continuous streams of multi-modal data. Despite their potential, evaluating their real-world interactive capabilities in streaming video contexts remains a formidable challenge. In this work, we introduce OmniMMI, a comprehensive multi-modal interaction benchmark tailored for OmniLLMs in streaming video contexts. OmniMMI encompasses over 1,121 videos and 2,290 questions, addressing two critical yet underexplored challenges in existing video benchmarks: streaming video understanding and proactive reasoning, across six distinct subtasks. Moreover, we propose a novel framework, Multi-modal Multiplexing Modeling (M4), designed to enable an inference-efficient streaming model that can see, listen while generating.
Problem

Research questions and friction points this paper is trying to address.

Evaluating real-world interactive capabilities of OmniLLMs in streaming videos
Addressing streaming video understanding and proactive reasoning challenges
Developing an efficient multi-modal model for streaming video contexts
Innovation

Methods, ideas, or system contributions that make the work stand out.

OmniMMI benchmark for streaming video evaluation
M4 framework for efficient multi-modal processing
Proactive reasoning in continuous data streams
Y
Yuxuan Wang
Beijing Institute for General Artificial Intelligence, State Key Laboratory of General Artificial Intelligence
Yueqian Wang
Yueqian Wang
Peking University
Multimodal Pre-trained Models
B
Bo Chen
Beijing Institute for General Artificial Intelligence, State Key Laboratory of General Artificial Intelligence
T
Tong Wu
Beijing Institute for General Artificial Intelligence, State Key Laboratory of General Artificial Intelligence
Dongyan Zhao
Dongyan Zhao
Peking University
Natural Language ProcessingSemantic Data ManagementQADialogue System
Z
Zilong Zheng
Beijing Institute for General Artificial Intelligence, State Key Laboratory of General Artificial Intelligence