RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video

📅 2025-05-04

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Existing multimodal large language model (MLLM) benchmarks predominantly rely on static images or single video frames, failing to assess models’ sustained perception, comprehension, and reasoning over dynamic, long-duration video streams. Method: We introduce RTV-Bench—the first fine-grained, real-time video analysis benchmark for MLLMs. It comprises 552 long-duration real-world videos (167.2 hours) and 4,631 high-quality multi-timestamp question-answer (MTQA) pairs, underpinned by a hierarchical question design and a multidimensional capability evaluation framework. Contribution/Results: Experiments reveal that open-source real-time models significantly outperform offline counterparts but still lag behind top-tier closed-source models. Crucially, neither increased parameter count nor higher frame sampling rates consistently improve performance—highlighting the critical role of architectural optimization. RTV-Bench establishes a standardized, reproducible evaluation paradigm for real-time multimodal reasoning, enabling systematic assessment of temporal understanding and streaming inference capabilities.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) increasingly excel at perception, understanding, and reasoning. However, current benchmarks inadequately evaluate their ability to perform these tasks continuously in dynamic, real-world environments. To bridge this gap, we introduce RTV-Bench, a fine-grained benchmark for MLLM real-time video analysis. RTV-Bench uses three key principles: (1) Multi-Timestamp Question Answering (MTQA), where answers evolve with scene changes; (2) Hierarchical Question Structure, combining basic and advanced queries; and (3) Multi-dimensional Evaluation, assessing the ability of continuous perception, understanding, and reasoning. RTV-Bench contains 552 diverse videos (167.2 hours) and 4,631 high-quality QA pairs. We evaluated leading MLLMs, including proprietary (GPT-4o, Gemini 2.0), open-source offline (Qwen2.5-VL, VideoLLaMA3), and open-source real-time (VITA-1.5, InternLM-XComposer2.5-OmniLive) models. Experiment results show open-source real-time models largely outperform offline ones but still trail top proprietary models. Our analysis also reveals that larger model size or higher frame sampling rates do not significantly boost RTV-Bench performance, sometimes causing slight decreases. This underscores the need for better model architectures optimized for video stream processing and long sequences to advance real-time video analysis with MLLMs. Our benchmark toolkit is available at: https://github.com/LJungang/RTV-Bench.

Problem

Research questions and friction points this paper is trying to address.

Evaluating MLLMs' continuous perception in dynamic environments

Assessing real-time video understanding and reasoning abilities

Bridging the gap in benchmarks for live video analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Timestamp QA for evolving scene changes

Hierarchical structure combining basic and advanced queries

Multi-dimensional evaluation of continuous perception

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs