MMLongCite: A Benchmark for Evaluating Fidelity of Long-Context Vision-Language Models

📅 2025-10-15

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

Existing evaluations of long-context fidelity in Large Vision-Language Models (LVLMs) focus exclusively on text, while multimodal settings remain constrained to short contexts. Method: We introduce LongMLF—the first benchmark for long-context multimodal fidelity assessment—covering text, image, and video modalities across eight tasks and six context-length intervals (4K–128K tokens). It features a context-sensitive task paradigm, a hierarchical sampling strategy, and a position-sensitivity analysis framework to systematically characterize how context length and critical information placement affect model performance. Contribution/Results: Experiments reveal substantial fidelity degradation in state-of-the-art LVLMs under long multimodal contexts, confirming LongMLF’s diagnostic rigor and challenge level. The benchmark fills a critical gap in multimodal evaluation, providing a reproducible, fine-grained standard to guide architectural and training improvements for long-context LVLMs.

Technology Category

Application Category

📝 Abstract

The rapid advancement of large vision language models (LVLMs) has led to a significant expansion of their context windows. However, an extended context window does not guarantee the effective utilization of the context, posing a critical challenge for real-world applications. Current evaluations of such long-context faithfulness are predominantly focused on the text-only domain, while multimodal assessments remain limited to short contexts. To bridge this gap, we introduce MMLongCite, a comprehensive benchmark designed to evaluate the fidelity of LVLMs in long-context scenarios. MMLongCite comprises 8 distinct tasks spanning 6 context length intervals and incorporates diverse modalities, including text, images, and videos. Our evaluation of state-of-the-art LVLMs reveals their limited faithfulness in handling long multimodal contexts. Furthermore, we provide an in-depth analysis of how context length and the position of crucial content affect the faithfulness of these models.

Problem

Research questions and friction points this paper is trying to address.

Evaluating fidelity of long-context vision-language models

Assessing multimodal faithfulness across extended context windows

Analyzing context length impact on model performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces MMLongCite benchmark for long-context evaluation

Spans 8 tasks across 6 context length intervals

Incorporates text, images, and videos modalities

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs