🤖 AI Summary
Current multimodal fusion evaluation is hindered by small-scale, narrow-domain, task-specific, and inconsistently standardized benchmarks, leading to poor model generalizability and incomparable results. To address this, we propose MMBench—the first large-scale, domain-adaptive multimodal fusion benchmark—integrating over 30 datasets, 15 modalities, and 20 predictive tasks across critical domains including healthcare, remote sensing, and industrial inspection. We design a unified cross-domain evaluation framework and an open-source automated pipeline supporting early-, late-, and hybrid-fusion paradigms. Our framework incorporates standardized preprocessing, cross-modal alignment, and domain-adaptation mechanisms. Extensive experiments establish multiple new state-of-the-art baselines, significantly improving model generalizability and reproducibility. MMBench provides a rigorous, open, and extensible evaluation infrastructure for advancing multimodal fusion research.
📝 Abstract
Although multimodal fusion has made significant progress, its advancement is severely hindered by the lack of adequate evaluation benchmarks. Current fusion methods are typically evaluated on a small selection of public datasets, a limited scope that inadequately represents the complexity and diversity of real-world scenarios, potentially leading to biased evaluations. This issue presents a twofold challenge. On one hand, models may overfit to the biases of specific datasets, hindering their generalization to broader practical applications. On the other hand, the absence of a unified evaluation standard makes fair and objective comparisons between different fusion methods difficult. Consequently, a truly universal and high-performance fusion model has yet to emerge. To address these challenges, we have developed a large-scale, domain-adaptive benchmark for multimodal evaluation. This benchmark integrates over 30 datasets, encompassing 15 modalities and 20 predictive tasks across key application domains. To complement this, we have also developed an open-source, unified, and automated evaluation pipeline that includes standardized implementations of state-of-the-art models and diverse fusion paradigms. Leveraging this platform, we have conducted large-scale experiments, successfully establishing new performance baselines across multiple tasks. This work provides the academic community with a crucial platform for rigorous and reproducible assessment of multimodal models, aiming to propel the field of multimodal artificial intelligence to new heights.