MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly

📅 2025-05-15

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

Long-context vision-language models (LCVLMs) lack systematic evaluation benchmarks, hindering rigorous assessment of their long-context capabilities. Method: We introduce MMLongBench—the first comprehensive benchmark for LCVLMs—comprising 13,331 samples across five task categories and supporting standardized evaluation at multimodal context lengths from 8K to 128K tokens. It features cross-modal hierarchical tokenization (integrating visual patches and textual tokens), multi-length input design, and a multi-dimensional framework covering diverse task–image type combinations. Contribution/Results: MMLongBench is the first to formally define and empirically evaluate LCVLMs’ long-context proficiency, revealing that single-task performance is insufficient to characterize holistic long-context behavior. Evaluating 46 state-of-the-art closed- and open-source models demonstrates pervasive long-context performance bottlenecks; notably, models with stronger reasoning capabilities exhibit greater robustness under extended contexts.

Technology Category

Application Category

📝 Abstract

The rapid extension of context windows in large vision-language models has given rise to long-context vision-language models (LCVLMs), which are capable of handling hundreds of images with interleaved text tokens in a single forward pass. In this work, we introduce MMLongBench, the first benchmark covering a diverse set of long-context vision-language tasks, to evaluate LCVLMs effectively and thoroughly. MMLongBench is composed of 13,331 examples spanning five different categories of downstream tasks, such as Visual RAG and Many-Shot ICL. It also provides broad coverage of image types, including various natural and synthetic images. To assess the robustness of the models to different input lengths, all examples are delivered at five standardized input lengths (8K-128K tokens) via a cross-modal tokenization scheme that combines vision patches and text tokens. Through a thorough benchmarking of 46 closed-source and open-source LCVLMs, we provide a comprehensive analysis of the current models' vision-language long-context ability. Our results show that: i) performance on a single task is a weak proxy for overall long-context capability; ii) both closed-source and open-source models face challenges in long-context vision-language tasks, indicating substantial room for future improvement; iii) models with stronger reasoning ability tend to exhibit better long-context performance. By offering wide task coverage, various image types, and rigorous length control, MMLongBench provides the missing foundation for diagnosing and advancing the next generation of LCVLMs.

Problem

Research questions and friction points this paper is trying to address.

Evaluating long-context vision-language models (LCVLMs) effectively

Assessing LCVLMs' robustness across diverse tasks and image types

Diagnosing performance gaps in current LCVLMs for future improvement

Innovation

Methods, ideas, or system contributions that make the work stand out.

First benchmark for long-context vision-language tasks

Combines vision patches and text tokens effectively

Standardized input lengths for robust model assessment

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs