🤖 AI Summary
Existing multimodal large language model (MLLM) benchmarks predominantly rely on static images or single video frames, failing to assess models’ sustained perception, comprehension, and reasoning over dynamic, long-duration video streams.
Method: We introduce RTV-Bench—the first fine-grained, real-time video analysis benchmark for MLLMs. It comprises 552 long-duration real-world videos (167.2 hours) and 4,631 high-quality multi-timestamp question-answer (MTQA) pairs, underpinned by a hierarchical question design and a multidimensional capability evaluation framework.
Contribution/Results: Experiments reveal that open-source real-time models significantly outperform offline counterparts but still lag behind top-tier closed-source models. Crucially, neither increased parameter count nor higher frame sampling rates consistently improve performance—highlighting the critical role of architectural optimization. RTV-Bench establishes a standardized, reproducible evaluation paradigm for real-time multimodal reasoning, enabling systematic assessment of temporal understanding and streaming inference capabilities.
📝 Abstract
Multimodal Large Language Models (MLLMs) increasingly excel at perception, understanding, and reasoning. However, current benchmarks inadequately evaluate their ability to perform these tasks continuously in dynamic, real-world environments. To bridge this gap, we introduce RTV-Bench, a fine-grained benchmark for MLLM real-time video analysis. RTV-Bench uses three key principles: (1) Multi-Timestamp Question Answering (MTQA), where answers evolve with scene changes; (2) Hierarchical Question Structure, combining basic and advanced queries; and (3) Multi-dimensional Evaluation, assessing the ability of continuous perception, understanding, and reasoning. RTV-Bench contains 552 diverse videos (167.2 hours) and 4,631 high-quality QA pairs. We evaluated leading MLLMs, including proprietary (GPT-4o, Gemini 2.0), open-source offline (Qwen2.5-VL, VideoLLaMA3), and open-source real-time (VITA-1.5, InternLM-XComposer2.5-OmniLive) models. Experiment results show open-source real-time models largely outperform offline ones but still trail top proprietary models. Our analysis also reveals that larger model size or higher frame sampling rates do not significantly boost RTV-Bench performance, sometimes causing slight decreases. This underscores the need for better model architectures optimized for video stream processing and long sequences to advance real-time video analysis with MLLMs. Our benchmark toolkit is available at: https://github.com/LJungang/RTV-Bench.