Multi-modal Data Spectrum: Multi-modal Datasets are Multi-dimensional

📅 2025-09-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal learning benchmarks inadequately characterize the interplay between intra-modal dependencies (single-modality contributions) and inter-modal dependencies (cross-modal interactions), leading to ambiguous evaluation of true multimodal reasoning. Method: We propose the “Multimodal Data Spectrum” framework and conduct the first large-scale, quantitative analysis across 23 visual question answering (VQA) benchmarks, leveraging attribution analysis and systematic ablation studies with multimodal large language models (MLLMs). Contribution/Results: We uncover a pervasive visual–textual dependency imbalance: several datasets designed to mitigate language bias inadvertently reinforce image-only shortcuts. Critically, MLLMs frequently exploit these shortcuts to mask deficits in genuine multimodal reasoning. Our findings yield interpretable, quantifiable evaluation principles for benchmark design—emphasizing balanced modality reliance—and advance the development of models capable of authentic joint reasoning.

Technology Category

Application Category

📝 Abstract
Understanding the interplay between intra-modality dependencies (the contribution of an individual modality to a target task) and inter-modality dependencies (the relationships between modalities and the target task) is fundamental to advancing multi-modal learning. However, the nature of and interaction between these dependencies within current benchmark evaluations remains poorly characterized. In this work, we present a large-scale empirical study to quantify these dependencies across 23 visual question-answering benchmarks using multi-modal large language models (MLLMs) covering domains such as general and expert knowledge reasoning, optical character recognition, and document understanding. Our findings show that the reliance on vision, question (text), and their interaction varies significantly, both across and within benchmarks. We discover that numerous benchmarks intended to mitigate text-only biases have inadvertently amplified image-only dependencies. This characterization persists across model sizes, as larger models often use these intra-modality dependencies to achieve high performance that mask an underlying lack of multi-modal reasoning. We provide a quantitative characterization of multi-modal datasets, enabling a principled approach to multi-modal benchmark design and evaluation.
Problem

Research questions and friction points this paper is trying to address.

Quantifying intra- and inter-modal dependencies in multi-modal learning benchmarks
Evaluating vision-text interaction variations across 23 VQA datasets
Identifying unintended biases in multi-modal benchmark design and evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Quantified intra-inter modality dependencies using MLLMs
Analyzed 23 VQA benchmarks across diverse domains
Provided dataset characterization for principled benchmark design