IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs

📅 2025-04-21

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Existing multimodal large language model (MLLM) evaluation frameworks largely overlook the critical role of image context in video understanding. Method: We introduce IV-Bench—the first comprehensive benchmark for image-anchored video perception and reasoning—comprising 967 videos, 2,585 human-annotated image-text queries, and 13 fine-grained tasks. We formally define and systematically evaluate image-driven video understanding, establishing a multi-task, multi-category annotation framework to analyze key factors including frame count, spatial resolution, and reasoning paradigms. Contribution/Results: Extensive evaluation across state-of-the-art MLLMs—including InternVL2.5, Qwen2.5-VL, GPT-4o, and Gemini 2—reveals severe performance limitations (maximum accuracy: 28.9%). Data synthesis analysis further identifies training bottlenecks. We publicly release the benchmark code and dataset to standardize evaluation and advance model development in image-anchored video reasoning.

Technology Category

Application Category

📝 Abstract

Existing evaluation frameworks for Multimodal Large Language Models (MLLMs) primarily focus on image reasoning or general video understanding tasks, largely overlooking the significant role of image context in video comprehension. To bridge this gap, we propose IV-Bench, the first comprehensive benchmark for evaluating Image-Grounded Video Perception and Reasoning. IV-Bench consists of 967 videos paired with 2,585 meticulously annotated image-text queries across 13 tasks (7 perception and 6 reasoning tasks) and 5 representative categories. Extensive evaluations of state-of-the-art open-source (e.g., InternVL2.5, Qwen2.5-VL) and closed-source (e.g., GPT-4o, Gemini2-Flash and Gemini2-Pro) MLLMs demonstrate that current models substantially underperform in image-grounded video Perception and Reasoning, merely achieving at most 28.9% accuracy. Further analysis reveals key factors influencing model performance on IV-Bench, including inference pattern, frame number, and resolution. Additionally, through a simple data synthesis approach, we demonstratethe challenges of IV- Bench extend beyond merely aligning the data format in the training proecss. These findings collectively provide valuable insights for future research. Our codes and data are released in https://github.com/multimodal-art-projection/IV-Bench.

Problem

Research questions and friction points this paper is trying to address.

Evaluating image-grounded video perception and reasoning in MLLMs

Addressing lack of benchmarks for image context in video comprehension

Assessing performance gaps in current MLLMs on multimodal tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces IV-Bench for image-grounded video evaluation

Includes 967 videos with 2,585 annotated queries

Analyzes key performance factors like resolution

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs