CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation

📅 2024-06-15

🏛️ arXiv.org

📈 Citations: 5

✨ Influential: 2

career value

182K/year

🤖 AI Summary

Contemporary multimodal large language models (MLLMs) suffer from narrative discontinuity, entity drift, and style mismatch in interleaved image-text generation—primarily due to insufficient cross-modal coherence in training data. To address this, we propose CoMM, the first high-quality, high-fidelity dataset explicitly designed for interleaved image-text generation. CoMM introduces a novel multi-perspective joint filtering mechanism—assessing semantic consistency, image insertion plausibility, and image-text alignment—and integrates instructional and visual-narrative data. We further design four new benchmark tasks and establish the first dedicated evaluation framework for this setting. Experiments demonstrate that CoMM substantially enhances MLLMs’ few-shot in-context learning capabilities, yielding an average +4.2% performance gain across downstream tasks including interleaved generation and cross-modal reasoning. Rigorous evaluation confirms CoMM’s high data quality and strong generalization capacity.

Technology Category

Application Category

📝 Abstract

Interleaved image-text generation has emerged as a crucial multimodal task, aiming at creating sequences of interleaved visual and textual content given a query. Despite notable advancements in recent multimodal large language models (MLLMs), generating integrated image-text sequences that exhibit narrative coherence and entity and style consistency remains challenging due to poor training data quality. To address this gap, we introduce CoMM, a high-quality Coherent interleaved image-text MultiModal dataset designed to enhance the coherence, consistency, and alignment of generated multimodal content. Initially, CoMM harnesses raw data from diverse sources, focusing on instructional content and visual storytelling, establishing a foundation for coherent and consistent content. To further refine the data quality, we devise a multi-perspective filter strategy that leverages advanced pre-trained models to ensure the development of sentences, consistency of inserted images, and semantic alignment between them. Various quality evaluation metrics are designed to prove the high quality of the filtered dataset. Meanwhile, extensive few-shot experiments on various downstream tasks demonstrate CoMM's effectiveness in significantly enhancing the in-context learning capabilities of MLLMs. Moreover, we propose four new tasks to evaluate MLLMs' interleaved generation abilities, supported by a comprehensive evaluation framework. We believe CoMM opens a new avenue for advanced MLLMs with superior multimodal in-context learning and understanding ability.

Problem

Research questions and friction points this paper is trying to address.

Generating coherent interleaved image-text sequences with narrative consistency

Addressing poor training data quality for multimodal content generation

Enhancing multimodal models' in-context learning and generation abilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

High-quality multimodal dataset CoMM

Multi-perspective filter strategy

Four new interleaved generation tasks

🔎 Similar Papers

No similar papers found.