LEMoN: Label Error Detection using Multimodal Neighbors

📅 2024-07-10

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

146K/year

🤖 AI Summary

Web-crawled image-text pairs frequently suffer from erroneous caption annotations, degrading downstream multimodal learning. Method: This paper proposes a novel automatic noise detection method based on multimodal neighborhood consistency, leveraging the k-nearest neighbor (k-NN) structure in the shared latent space of contrastive pre-trained models (e.g., CLIP). Unlike conventional filtering approaches relying solely on image–text similarity, our method quantifies annotation errors by statistically measuring cross-modal neighbor inconsistency. Contribution/Results: Theoretical analysis and empirical evaluation demonstrate the detrimental impact of noisy labels on downstream tasks. Our method achieves significant improvements over state-of-the-art baselines across multiple benchmark datasets. Models trained on data cleaned by our approach consistently outperform those trained on raw or alternatively filtered data in both image classification and image captioning, validating its effectiveness and generalizability.

Technology Category

Application Category

📝 Abstract

Large repositories of image-caption pairs are essential for the development of vision-language models. However, these datasets are often extracted from noisy data scraped from the web, and contain many mislabeled examples. In order to improve the reliability of downstream models, it is important to identify and filter images with incorrect captions. However, beyond filtering based on image-caption embedding similarity, no prior works have proposed other methods to filter noisy multimodal data, or concretely assessed the impact of noisy captioning data on downstream training. In this work, we propose LEMoN, a method to automatically identify label errors in multimodal datasets. Our method leverages the multimodal neighborhood of image-caption pairs in the latent space of contrastively pretrained multimodal models. We find that our method outperforms the baselines in label error identification, and that training on datasets filtered using our method improves downstream classification and captioning performance.

Problem

Research questions and friction points this paper is trying to address.

Detects mislabeled image-caption pairs in noisy datasets

Improves reliability of vision-language model training data

Enhances downstream captioning performance by filtering errors

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses multimodal neighborhood analysis

Leverages contrastively pretrained models

Improves label error detection accuracy

🔎 Similar Papers

No similar papers found.