Data Leakage in Visual Datasets

📅 2025-08-24

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This work systematically exposes data leakage in visual benchmark datasets—where training images inadvertently contaminate evaluation sets, leading to inflated and unreliable model performance estimates. We propose a fine-grained classification framework that characterizes leakage along three orthogonal dimensions: modality (pixel-level, feature-level, semantic-level), scope (instance-, class-, or distribution-level), and severity. Leveraging cross-dataset image retrieval and quantitative similarity analysis, we empirically detect leakage across 12 major vision benchmarks, including ImageNet and COCO. Results reveal pervasive leakage in all evaluated datasets; in some cases, test samples exhibit similarity scores up to 0.92 with training images, artificially inflating state-of-the-art model accuracy by 3.7–11.2%. To our knowledge, this is the first study to establish a reproducible, end-to-end leakage diagnosis pipeline, providing both theoretical foundations and practical tools for trustworthy visual evaluation.

Technology Category

Application Category

📝 Abstract

We analyze data leakage in visual datasets. Data leakage refers to images in evaluation benchmarks that have been seen during training, compromising fair model evaluation. Given that large-scale datasets are often sourced from the internet, where many computer vision benchmarks are publicly available, our efforts are focused into identifying and studying this phenomenon. We characterize visual leakage into different types according to its modality, coverage, and degree. By applying image retrieval techniques, we unequivocally show that all the analyzed datasets present some form of leakage, and that all types of leakage, from severe instances to more subtle cases, compromise the reliability of model evaluation in downstream tasks.

Problem

Research questions and friction points this paper is trying to address.

Identifying data leakage in visual evaluation datasets

Characterizing leakage types by modality, coverage, and degree

Assessing how leakage compromises model evaluation reliability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Image retrieval techniques identify data leakage

Characterize leakage by modality, coverage, degree

All analyzed datasets show leakage compromising evaluation

🔎 Similar Papers

Investigating the Quality of DermaMNIST and Fitzpatrick17k Dermatological Image Datasets