🤖 AI Summary
Existing evaluations of long-context fidelity in Large Vision-Language Models (LVLMs) focus exclusively on text, while multimodal settings remain constrained to short contexts. Method: We introduce LongMLF—the first benchmark for long-context multimodal fidelity assessment—covering text, image, and video modalities across eight tasks and six context-length intervals (4K–128K tokens). It features a context-sensitive task paradigm, a hierarchical sampling strategy, and a position-sensitivity analysis framework to systematically characterize how context length and critical information placement affect model performance. Contribution/Results: Experiments reveal substantial fidelity degradation in state-of-the-art LVLMs under long multimodal contexts, confirming LongMLF’s diagnostic rigor and challenge level. The benchmark fills a critical gap in multimodal evaluation, providing a reproducible, fine-grained standard to guide architectural and training improvements for long-context LVLMs.
📝 Abstract
The rapid advancement of large vision language models (LVLMs) has led to a significant expansion of their context windows. However, an extended context window does not guarantee the effective utilization of the context, posing a critical challenge for real-world applications. Current evaluations of such long-context faithfulness are predominantly focused on the text-only domain, while multimodal assessments remain limited to short contexts. To bridge this gap, we introduce MMLongCite, a comprehensive benchmark designed to evaluate the fidelity of LVLMs in long-context scenarios. MMLongCite comprises 8 distinct tasks spanning 6 context length intervals and incorporates diverse modalities, including text, images, and videos. Our evaluation of state-of-the-art LVLMs reveals their limited faithfulness in handling long multimodal contexts. Furthermore, we provide an in-depth analysis of how context length and the position of crucial content affect the faithfulness of these models.