VSAS-BENCH: Real-Time Evaluation of Visual Streaming Assistant Models

📅 2026-04-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing evaluations of vision-language models predominantly focus on offline settings, failing to adequately assess critical capabilities of streaming visual assistants—such as response timeliness (proactivity) and temporal consistency. To address this gap, this work proposes VSAS-Bench, the first comprehensive benchmark tailored for streaming vision-language models, comprising over 18,000 densely temporally annotated samples across multiple domains. The benchmark introduces synchronized and asynchronous evaluation protocols that systematically disentangle model performance along key dimensions: accuracy, latency, proactivity, and consistency. Experiments demonstrate that a minimally adapted general-purpose VLM (e.g., Qwen3-VL-4B) outperforms the current state-of-the-art streaming model, Dispider, by 3% under the asynchronous protocol, validating the framework’s efficacy. Furthermore, the study reveals that input strategies, resolution, and memory mechanisms critically influence the trade-off between accuracy and latency.
📝 Abstract
Streaming vision-language models (VLMs) continuously generate responses given an instruction prompt and an online stream of input frames. This is a core mechanism for real-time visual assistants. Existing VLM frameworks predominantly assess models in offline settings. In contrast, the performance of a streaming VLM depends on additional metrics beyond pure video understanding, including proactiveness, which reflects the timeliness of the model's responses, and consistency, which captures the robustness of its responses over time. To address this limitation, we propose VSAS-Bench, a new framework and benchmark for Visual Streaming Assistants. In contrast to prior benchmarks that primarily employ single-turn question answering on video inputs, VSAS-Bench features temporally dense annotations with over 18,000 annotations across diverse input domains and task types. We introduce standardized synchronous and asynchronous evaluation protocols, along with metrics that isolate and measure distinct capabilities of streaming VLMs. Using this framework, we conduct large-scale evaluations of recent video and streaming VLMs, analyzing the accuracy-latency trade-off under key design factors such as memory buffer length, memory access policy, and input resolution, yielding several practical insights. Finally, we show empirically that conventional VLMs can be adapted to streaming settings without additional training, and demonstrate that these adapted models outperform recent streaming VLMs. For example, Qwen3-VL-4B surpasses Dispider, the best streaming VLM on our benchmark, by 3% under the asynchronous protocol. The benchmark and code will be available at https://github.com/apple/ml-vsas-bench.
Problem

Research questions and friction points this paper is trying to address.

streaming vision-language models
real-time evaluation
proactiveness
consistency
visual assistants
Innovation

Methods, ideas, or system contributions that make the work stand out.

Streaming Vision-Language Models
Real-Time Evaluation
Proactiveness
Consistency
VSAS-Bench
🔎 Similar Papers
No similar papers found.