COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts

๐Ÿ“… 2026-04-29
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

191K/year
๐Ÿค– AI Summary
Current evaluation frameworks lack a systematic benchmark for assessing the fine-grained image-text alignment and comprehension capabilities of multimodal large language models (MLLMs) in interleaved visual-textual contexts. To address this gap, this work proposes COHERENCEโ€”the first evaluation framework specifically designed for fine-grained image-text alignment in interleaved multimodal settings. The authors construct a high-quality benchmark dataset comprising 6,161 questions across four domains and introduce an MLLM-based evaluation methodology coupled with a structured error attribution mechanism. Through six categories of fine-grained error analysis, the study precisely identifies critical shortcomings of existing models in cross-modal contextual reasoning, thereby offering clear directions for future improvements.
๐Ÿ“ Abstract
In recent years, Multimodal Large Language Models (MLLMs) have achieved remarkable progress on a wide range of multimodal benchmarks. Despite these advances, most existing benchmarks mainly focus on single-image or multi-image comprehension. In real-world scenarios such as document reading, information is often presented as interleaved multimodel contexts. This requires MLLMs not only to recognize the content of individual images, but also to identify relevant textual and visual evidence, establish fine-grained alignments between them, and reason over these aligned signals in interleaved contexts based on contextual evidence.However, there is still a lack of systematic benchmarks for quantifying the fine-grained understanding ability of MLLMs in interleaved image-text contexts. To fill this gap, we propose COHERENCE, a benchmark designed to evaluate the ability of MLLMs to recover fine-grained image-text correspondences in interleaved multimodal contexts. COHERENCE covers interleaved image-text content from four representative domains and contains 6,161 high-quality questions. Moreover, we perform a six-type error analysis, enabling fine-grained attribution of failures in interleaved image-text understanding to the specific capabilities missing in current MLLMs.
Problem

Research questions and friction points this paper is trying to address.

fine-grained alignment
interleaved multimodal contexts
image-text correspondence
multimodal benchmark
MLLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

fine-grained alignment
interleaved multimodal contexts
multimodal benchmark
image-text correspondence
error analysis
๐Ÿ”Ž Similar Papers
No similar papers found.