๐ค AI Summary
This work systematically exposes data leakage in visual benchmark datasetsโwhere training images inadvertently contaminate evaluation sets, leading to inflated and unreliable model performance estimates. We propose a fine-grained classification framework that characterizes leakage along three orthogonal dimensions: modality (pixel-level, feature-level, semantic-level), scope (instance-, class-, or distribution-level), and severity. Leveraging cross-dataset image retrieval and quantitative similarity analysis, we empirically detect leakage across 12 major vision benchmarks, including ImageNet and COCO. Results reveal pervasive leakage in all evaluated datasets; in some cases, test samples exhibit similarity scores up to 0.92 with training images, artificially inflating state-of-the-art model accuracy by 3.7โ11.2%. To our knowledge, this is the first study to establish a reproducible, end-to-end leakage diagnosis pipeline, providing both theoretical foundations and practical tools for trustworthy visual evaluation.
๐ Abstract
We analyze data leakage in visual datasets. Data leakage refers to images in evaluation benchmarks that have been seen during training, compromising fair model evaluation. Given that large-scale datasets are often sourced from the internet, where many computer vision benchmarks are publicly available, our efforts are focused into identifying and studying this phenomenon. We characterize visual leakage into different types according to its modality, coverage, and degree. By applying image retrieval techniques, we unequivocally show that all the analyzed datasets present some form of leakage, and that all types of leakage, from severe instances to more subtle cases, compromise the reliability of model evaluation in downstream tasks.