OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts

📅 2025-03-29

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Existing video benchmarks inadequately evaluate multimodal large language models’ (MLLMs) real-time interactive capabilities and active reasoning in streaming video. This paper introduces the first multimodal interactive evaluation benchmark tailored for streaming video, comprising over 1,121 videos and 2,290 questions, explicitly targeting two core challenges: streaming understanding and active reasoning. We propose a novel active reasoning evaluation paradigm and the M4 multimodal reuse modeling framework, which jointly integrates visual, auditory, and linguistic modalities via streaming chunked encoding, cross-modal temporal alignment, incremental response generation, and dynamic attention gating. We systematically evaluate 12 state-of-the-art OmniLLMs across six fine-grained subtasks, uncovering critical bottlenecks in real-world streaming interaction. Empirical results show that M4 achieves a 2.3× speedup and 37% memory reduction over baselines while maintaining 98.5% interactive accuracy.

Technology Category

Application Category

📝 Abstract

The rapid advancement of multi-modal language models (MLLMs) like GPT-4o has propelled the development of Omni language models, designed to process and proactively respond to continuous streams of multi-modal data. Despite their potential, evaluating their real-world interactive capabilities in streaming video contexts remains a formidable challenge. In this work, we introduce OmniMMI, a comprehensive multi-modal interaction benchmark tailored for OmniLLMs in streaming video contexts. OmniMMI encompasses over 1,121 videos and 2,290 questions, addressing two critical yet underexplored challenges in existing video benchmarks: streaming video understanding and proactive reasoning, across six distinct subtasks. Moreover, we propose a novel framework, Multi-modal Multiplexing Modeling (M4), designed to enable an inference-efficient streaming model that can see, listen while generating.

Problem

Research questions and friction points this paper is trying to address.

Evaluating real-world interactive capabilities of OmniLLMs in streaming videos

Addressing streaming video understanding and proactive reasoning challenges

Developing an efficient multi-modal model for streaming video contexts

Innovation

Methods, ideas, or system contributions that make the work stand out.

OmniMMI benchmark for streaming video evaluation

M4 framework for efficient multi-modal processing

Proactive reasoning in continuous data streams

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs