OmniInteract: Benchmarking Real-World Streaming Interaction for Real-Time Omnimodal Assistants

📅 2026-05-25

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Current large language models struggle to process multimodal interactions in real-time audio-visual streams, particularly failing to respond online to user queries and environmental sounds embedded in audio. This work introduces the first streaming interaction benchmark for real-time, full-modality large language models, preserving raw audio-visual inputs and requiring models—without access to future context—to detect multimodal triggers, determine appropriate response timing, and generate dynamic answers across complex scenarios including real-time question answering, proactive interaction, and nested tasks, with a novel 1QnA continuous-task monitoring mechanism. We propose new metrics—Interaction-Aware Quality-Timeliness F1, Interrupt Diagnostic Suite, and Nested Chain Completion Score—to jointly evaluate response accuracy, temporal precision, and contextual coherence. Experiments reveal that even the best current model achieves only an IA-QTF1 score of 0.368 (dropping to 0.052 in 1QnA settings), underscoring the significant challenge of transferring offline capabilities to online, full-duplex multimodal interaction.

📝 Abstract

We introduce OmniInteract, a streaming benchmark for real-time omnimodal large language models evaluated through native online inference over audio-visual streams. Unlike offline video understanding or text-prompted streaming QA, OmniInteract preserves the original audio-visual stream and requires models to process it online, without access to future content. User queries and ambient sounds are embedded in the audio track, requiring models to detect multimodal triggers, decide when to respond, and answer while the stream unfolds. OmniInteract contains 250 videos with 1,430 temporally grounded response slots: 1,062 1Q1A slots across real-time, proactive, and nested scenarios, and 368 1QnA slots for continuous task monitoring and step guidance. Each slot includes a trigger, response window, and target answer. We evaluate response correctness, timing, invalid outputs, interruption handling, and context continuity using Interaction-Aware Quality-Timeliness F1, Interruption Diagnostic Suite, and Nested Chain Completion Score. Experiments show that current models remain weak in streaming interaction, with the best overall IA-QTF1 reaching only 0.368 and the best 1QnA IA-QTF1 only 0.052. Further study on mathematical reasoning in full-duplex settings shows that offline capability does not necessarily transfer to online interaction. Code and datasets will be made publicly accessible at https://github.com/Lucky-Lance/OmniInteract.

Problem

Research questions and friction points this paper is trying to address.

streaming interaction

real-time omnimodal

audio-visual streams

multimodal triggers

online inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

streaming interaction

omnimodal LLMs

online inference

multimodal triggers

real-time QA

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs

2024-06-26Citations: 4