π€ AI Summary
Web-crawled image-text pairs frequently suffer from erroneous caption annotations, degrading downstream multimodal learning. Method: This paper proposes a novel automatic noise detection method based on multimodal neighborhood consistency, leveraging the k-nearest neighbor (k-NN) structure in the shared latent space of contrastive pre-trained models (e.g., CLIP). Unlike conventional filtering approaches relying solely on imageβtext similarity, our method quantifies annotation errors by statistically measuring cross-modal neighbor inconsistency. Contribution/Results: Theoretical analysis and empirical evaluation demonstrate the detrimental impact of noisy labels on downstream tasks. Our method achieves significant improvements over state-of-the-art baselines across multiple benchmark datasets. Models trained on data cleaned by our approach consistently outperform those trained on raw or alternatively filtered data in both image classification and image captioning, validating its effectiveness and generalizability.
π Abstract
Large repositories of image-caption pairs are essential for the development of vision-language models. However, these datasets are often extracted from noisy data scraped from the web, and contain many mislabeled examples. In order to improve the reliability of downstream models, it is important to identify and filter images with incorrect captions. However, beyond filtering based on image-caption embedding similarity, no prior works have proposed other methods to filter noisy multimodal data, or concretely assessed the impact of noisy captioning data on downstream training. In this work, we propose LEMoN, a method to automatically identify label errors in multimodal datasets. Our method leverages the multimodal neighborhood of image-caption pairs in the latent space of contrastively pretrained multimodal models. We find that our method outperforms the baselines in label error identification, and that training on datasets filtered using our method improves downstream classification and captioning performance.