Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination

📅 2024-11-06

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Data contamination—particularly undetectable cross-modal leakage between text and images—plagues multimodal large language model (MLLM) training, inflating benchmark performance and undermining evaluation reliability. To address this, we propose MM-Detect, the first contamination detection framework tailored for MLLMs. It uniquely distinguishes contamination sources across two stages: LLM pretraining and MLLM fine-tuning. Our method introduces a cross-modal sensitivity detection mechanism, leveraging multimodal embedding alignment, cross-modal similarity distillation, and stage-wise attribution analysis to quantify contamination’s impact on performance. Using a controlled contamination injection experimental paradigm, MM-Detect successfully identifies significant training-set leakage in multiple state-of-the-art MLLMs on benchmarks including MMBench and OCRBench; up to 37% of certain models’ performance gains are attributable to contamination. MM-Detect thus provides a critical tool and empirical foundation for fair, rigorous MLLM evaluation.

Technology Category

Application Category

📝 Abstract

The rapid progression of multimodal large language models (MLLMs) has demonstrated superior performance on various multimodal benchmarks. However, the issue of data contamination during training creates challenges in performance evaluation and comparison. While numerous methods exist for detecting dataset contamination in large language models (LLMs), they are less effective for MLLMs due to their various modalities and multiple training phases. In this study, we introduce a multimodal data contamination detection framework, MM-Detect, designed for MLLMs. Our experimental results indicate that MM-Detect is sensitive to varying degrees of contamination and can highlight significant performance improvements due to leakage of the training set of multimodal benchmarks. Furthermore, We also explore the possibility of contamination originating from the pre-training phase of LLMs used by MLLMs and the fine-tuning phase of MLLMs, offering new insights into the stages at which contamination may be introduced.

Problem

Research questions and friction points this paper is trying to address.

Detecting multimodal data contamination effectively

Identifying leakage in multimodal benchmark training sets

Exploring contamination origins in MLLMs training phases

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal data contamination detection

MM-Detect framework for MLLMs

Identifies contamination in training phases

🔎 Similar Papers

No similar papers found.