MMLongCite: A Benchmark for Evaluating Fidelity of Long-Context Vision-Language Models

📅 2025-10-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing evaluations of long-context fidelity in Large Vision-Language Models (LVLMs) focus exclusively on text, while multimodal settings remain constrained to short contexts. Method: We introduce LongMLF—the first benchmark for long-context multimodal fidelity assessment—covering text, image, and video modalities across eight tasks and six context-length intervals (4K–128K tokens). It features a context-sensitive task paradigm, a hierarchical sampling strategy, and a position-sensitivity analysis framework to systematically characterize how context length and critical information placement affect model performance. Contribution/Results: Experiments reveal substantial fidelity degradation in state-of-the-art LVLMs under long multimodal contexts, confirming LongMLF’s diagnostic rigor and challenge level. The benchmark fills a critical gap in multimodal evaluation, providing a reproducible, fine-grained standard to guide architectural and training improvements for long-context LVLMs.

Technology Category

Application Category

📝 Abstract
The rapid advancement of large vision language models (LVLMs) has led to a significant expansion of their context windows. However, an extended context window does not guarantee the effective utilization of the context, posing a critical challenge for real-world applications. Current evaluations of such long-context faithfulness are predominantly focused on the text-only domain, while multimodal assessments remain limited to short contexts. To bridge this gap, we introduce MMLongCite, a comprehensive benchmark designed to evaluate the fidelity of LVLMs in long-context scenarios. MMLongCite comprises 8 distinct tasks spanning 6 context length intervals and incorporates diverse modalities, including text, images, and videos. Our evaluation of state-of-the-art LVLMs reveals their limited faithfulness in handling long multimodal contexts. Furthermore, we provide an in-depth analysis of how context length and the position of crucial content affect the faithfulness of these models.
Problem

Research questions and friction points this paper is trying to address.

Evaluating fidelity of long-context vision-language models
Assessing multimodal faithfulness across extended context windows
Analyzing context length impact on model performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces MMLongCite benchmark for long-context evaluation
Spans 8 tasks across 6 context length intervals
Incorporates text, images, and videos modalities
🔎 Similar Papers
No similar papers found.