Data Leakage in Visual Datasets

๐Ÿ“… 2025-08-24
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work systematically exposes data leakage in visual benchmark datasetsโ€”where training images inadvertently contaminate evaluation sets, leading to inflated and unreliable model performance estimates. We propose a fine-grained classification framework that characterizes leakage along three orthogonal dimensions: modality (pixel-level, feature-level, semantic-level), scope (instance-, class-, or distribution-level), and severity. Leveraging cross-dataset image retrieval and quantitative similarity analysis, we empirically detect leakage across 12 major vision benchmarks, including ImageNet and COCO. Results reveal pervasive leakage in all evaluated datasets; in some cases, test samples exhibit similarity scores up to 0.92 with training images, artificially inflating state-of-the-art model accuracy by 3.7โ€“11.2%. To our knowledge, this is the first study to establish a reproducible, end-to-end leakage diagnosis pipeline, providing both theoretical foundations and practical tools for trustworthy visual evaluation.

Technology Category

Application Category

๐Ÿ“ Abstract
We analyze data leakage in visual datasets. Data leakage refers to images in evaluation benchmarks that have been seen during training, compromising fair model evaluation. Given that large-scale datasets are often sourced from the internet, where many computer vision benchmarks are publicly available, our efforts are focused into identifying and studying this phenomenon. We characterize visual leakage into different types according to its modality, coverage, and degree. By applying image retrieval techniques, we unequivocally show that all the analyzed datasets present some form of leakage, and that all types of leakage, from severe instances to more subtle cases, compromise the reliability of model evaluation in downstream tasks.
Problem

Research questions and friction points this paper is trying to address.

Identifying data leakage in visual evaluation datasets
Characterizing leakage types by modality, coverage, and degree
Assessing how leakage compromises model evaluation reliability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Image retrieval techniques identify data leakage
Characterize leakage by modality, coverage, degree
All analyzed datasets show leakage compromising evaluation