PhoStream: Benchmarking Real-World Streaming for Omnimodal Assistants in Mobile Scenarios

📅 2026-01-30

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Current models excel in offline multimodal understanding but struggle to accurately determine response timing when processing continuous audio-visual streams on real-world mobile devices, and lack dedicated evaluation benchmarks for this setting. To address this gap, this work proposes PhoStream—the first streaming multimodal assistant benchmark tailored for mobile scenarios—encompassing four in-screen and off-screen contexts and ten core capabilities. The benchmark comprises 578 videos and 5,572 open-ended question-answer pairs, constructed through automated generation followed by human validation. Evaluation employs an online inference pipeline and an LLM-as-a-Judge automatic scoring mechanism. Experiments reveal that state-of-the-art models such as Gemini 3 Pro achieve over 80% accuracy on immediate and retrospective tasks but drop sharply to 16.40% on prospective tasks, exposing a fundamental deficiency in judging appropriate response timing.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models excel at offline audio-visual understanding, but their ability to serve as mobile assistants in continuous real-world streams remains underexplored. In daily phone use, mobile assistants must track streaming audio-visual inputs and respond at the right time, yet existing benchmarks are often restricted to multiple-choice questions or use shorter videos. In this paper, we introduce PhoStream, the first mobile-centric streaming benchmark that unifies on-screen and off-screen scenarios to evaluate video, audio, and temporal reasoning. PhoStream contains 5,572 open-ended QA pairs from 578 videos across 4 scenarios and 10 capabilities. We build it with an Automated Generative Pipeline backed by rigorous human verification, and evaluate models using a realistic Online Inference Pipeline and LLM-as-a-Judge evaluation for open-ended responses. Experiments reveal a temporal asymmetry in LLM-judged scores (0-100): models perform well on Instant and Backward tasks (Gemini 3 Pro exceeds 80), but drop sharply on Forward tasks (16.40), largely due to early responses before the required visual and audio cues appear. This highlights a fundamental limitation: current MLLMs struggle to decide when to speak, not just what to say. Code and datasets used in this work will be made publicly accessible at https://github.com/Lucky-Lance/PhoStream.

Problem

Research questions and friction points this paper is trying to address.

streaming

mobile assistants

multimodal large language models

temporal reasoning

real-world scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

streaming benchmark

multimodal LLMs

temporal reasoning

mobile assistants

LLM-as-a-Judge

🔎 Similar Papers

No similar papers found.