LEMoN: Label Error Detection using Multimodal Neighbors

πŸ“… 2024-07-10
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Web-crawled image-text pairs frequently suffer from erroneous caption annotations, degrading downstream multimodal learning. Method: This paper proposes a novel automatic noise detection method based on multimodal neighborhood consistency, leveraging the k-nearest neighbor (k-NN) structure in the shared latent space of contrastive pre-trained models (e.g., CLIP). Unlike conventional filtering approaches relying solely on image–text similarity, our method quantifies annotation errors by statistically measuring cross-modal neighbor inconsistency. Contribution/Results: Theoretical analysis and empirical evaluation demonstrate the detrimental impact of noisy labels on downstream tasks. Our method achieves significant improvements over state-of-the-art baselines across multiple benchmark datasets. Models trained on data cleaned by our approach consistently outperform those trained on raw or alternatively filtered data in both image classification and image captioning, validating its effectiveness and generalizability.

Technology Category

Application Category

πŸ“ Abstract
Large repositories of image-caption pairs are essential for the development of vision-language models. However, these datasets are often extracted from noisy data scraped from the web, and contain many mislabeled examples. In order to improve the reliability of downstream models, it is important to identify and filter images with incorrect captions. However, beyond filtering based on image-caption embedding similarity, no prior works have proposed other methods to filter noisy multimodal data, or concretely assessed the impact of noisy captioning data on downstream training. In this work, we propose LEMoN, a method to automatically identify label errors in multimodal datasets. Our method leverages the multimodal neighborhood of image-caption pairs in the latent space of contrastively pretrained multimodal models. We find that our method outperforms the baselines in label error identification, and that training on datasets filtered using our method improves downstream classification and captioning performance.
Problem

Research questions and friction points this paper is trying to address.

Detects mislabeled image-caption pairs in noisy datasets
Improves reliability of vision-language model training data
Enhances downstream captioning performance by filtering errors
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses multimodal neighborhood analysis
Leverages contrastively pretrained models
Improves label error detection accuracy
πŸ”Ž Similar Papers
No similar papers found.
H
Haoran Zhang
Massachusetts Institute of Technology
A
Aparna Balagopalan
Massachusetts Institute of Technology
Nassim Oufattole
Nassim Oufattole
PhD Student, Massachusetts Institute of Technology
Data-ScienceMachine LearningStatistics
H
Hyewon Jeong
Massachusetts Institute of Technology
Y
Yan Wu
Massachusetts Institute of Technology
Jiacheng Zhu
Jiacheng Zhu
MIT
Machine LearningFoundation ModelsOptimal TransportBayesian modeling
M
Marzyeh Ghassemi
Massachusetts Institute of Technology