CRIT: Graph-Based Automatic Data Synthesis to Enhance Cross-Modal Multi-Hop Reasoning

📅 2026-04-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language models exhibit limited performance on cross-modal multi-hop reasoning tasks, primarily due to the absence of training and evaluation data that explicitly enforce joint integration of visual and textual information across multiple reasoning steps. To address this gap, this work proposes a graph-structured automated synthesis framework to construct CRIT, a multimodal dataset comprising images, videos, and text. By leveraging multimodal alignment and generating explicit reasoning trajectories, the framework compels models to collaboratively utilize cross-modal cues throughout the multi-hop inference process. This approach not only fills a critical void in existing benchmarks for complex cross-modal reasoning but also substantially enhances model performance on CRIT as well as established multimodal reasoning benchmarks such as SPIQA.
📝 Abstract
Real-world reasoning often requires combining information across modalities, connecting textual context with visual cues in a multi-hop process. Yet, most multimodal benchmarks fail to capture this ability: they typically rely on single images or set of images, where answers can be inferred from a single modality alone. This limitation is mirrored in the training data, where interleaved image-text content rarely enforces complementary, multi-hop reasoning. As a result, Vision-Language Models (VLMs) frequently hallucinate and produce reasoning traces poorly grounded in visual evidence. To address this gap, we introduce CRIT, a new dataset and benchmark built with a graph-based automatic pipeline for generating complex cross-modal reasoning tasks. CRIT consists of diverse domains ranging from natural images, videos, and text-rich sources, and includes a manually verified test set for reliable evaluation. Experiments on this benchmark reveal that even state-of-the-art models struggle on such reasoning tasks. Models trained on CRIT show significant gains in cross-modal multi-hop reasoning, including strong improvements on SPIQA and other standard multimodal benchmarks.
Problem

Research questions and friction points this paper is trying to address.

cross-modal reasoning
multi-hop reasoning
vision-language models
multimodal benchmarks
data synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

graph-based data synthesis
cross-modal reasoning
multi-hop reasoning
vision-language models
automatic dataset generation
🔎 Similar Papers
No similar papers found.