Corrupted but Not Broken: Rethinking the Impact of Corrupted Data in Visual Instruction Tuning

📅 2025-02-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the impact mechanisms of hallucination, erroneous responses, and low-quality OCR—collectively termed “contamination”—on multimodal large language models (MLLMs) during visual instruction tuning (VIT). We find that contamination-induced degradation is superficial, primarily affecting output-layer parameters; thus, freezing lower-layer parameters or fine-tuning with merely 1% clean data suffices to restore over 95% of original performance. Building on this insight, we propose the first external-label-free self-verifying data cleaning framework: it identifies contaminated samples via parameter plasticity analysis, then integrates self-supervised confidence estimation with contamination-aware lightweight post-training, forming a two-stage robust debiasing paradigm. Our method significantly outperforms existing approaches across multiple VIT benchmarks and enables end-to-end automatic data cleaning.

Technology Category

Application Category

📝 Abstract
Visual Instruction Tuning (VIT) enhances Multimodal Large Language Models (MLLMs) but it is hindered by corrupted datasets containing hallucinated content, incorrect responses, and poor OCR quality. While prior works focus on dataset refinement through high-quality data collection or rule-based filtering, they are costly or limited to specific types of corruption. To deeply understand how corrupted data affects MLLMs, in this paper, we systematically investigate this issue and find that while corrupted data degrades the performance of MLLMs, its effects are largely superficial in that the performance of MLLMs can be largely restored by either disabling a small subset of parameters or post-training with a small amount of clean data. Additionally, corrupted MLLMs exhibit improved ability to distinguish clean samples from corrupted ones, enabling the dataset cleaning without external help. Based on those insights, we propose a corruption-robust training paradigm combining self-validation and post-training, which significantly outperforms existing corruption mitigation strategies.
Problem

Research questions and friction points this paper is trying to address.

Impact of corrupted data on MLLMs
Restoration methods for corrupted MLLMs
Corruption-robust training paradigm
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-validation enhances corruption robustness
Post-training restores MLLM performance
Small clean data mitigates corruption impact
🔎 Similar Papers
No similar papers found.
Y
Yunhao Gou
Southern University of Science and Technology, The Hong Kong University of Science and Technology
Hansi Yang
Hansi Yang
Hong Kong University of Science and Technology
meta-learningfew-shot learningAutoML
Zhili Liu
Zhili Liu
Beike
SLAMDLHPCComputer Graphics
K
Kai Chen
The Hong Kong University of Science and Technology
Y
Yihan Zeng
Huawei Noah’s Ark Lab
L
Lanqing Hong
Huawei Noah’s Ark Lab
Zhenguo Li
Zhenguo Li
Huawei Noah's Ark Lab, Columbia, CUHK, PKU
machine learninggenerative AIAI for mathematics
Q
Qun Liu
Huawei Noah’s Ark Lab
James T. Kwok
James T. Kwok
Professor of Computer Science and Engineering, Hong Kong University of Science and Technology
Machine learning
Y
Yu Zhang
Southern University of Science and Technology