🤖 AI Summary
This work addresses the overestimation of semantic shifts between large-scale natural image datasets in existing supervised classification methods, which often rely on non-semantic artifacts such as resolution rather than genuine semantic content. To rectify this, the authors propose the first unsupervised semantic clustering framework that leverages semantic features extracted from foundation vision models to directly assess inter-dataset semantic separability without requiring labels. Through carefully designed controlled experiments, they demonstrate that the high classification accuracy reported by conventional approaches primarily stems from non-semantic confounders. When applied to mainstream web-scale datasets, their method yields clustering performance near random chance, providing strong evidence that previously reported semantic discrepancies have been substantially exaggerated.
📝 Abstract
In computer vision, a prevailing method for quantifying dataset bias is to train a model to distinguish between datasets. High classification accuracy is then interpreted as evidence of meaningful semantic differences. This approach assumes that standard image augmentations successfully suppress low-level, non-semantic cues, and that any remaining performance must therefore reflect true semantic divergence. We demonstrate that this fundamental assumption is flawed within the domain of large-scale natural image collections. High classification accuracy is often driven by resolution-based artifacts, which are structural fingerprints arising from native image resolution distributions and interpolation effects during resizing. These artifacts form robust, dataset-specific signatures that persist despite conventional image corruptions. Through controlled experiments, we show that models achieve strong dataset classification even on non-semantic, procedurally generated images, proving their reliance on superficial cues. To address this issue, we revisit this decades-old idea of dataset separability, but not with supervised classification. Instead, we introduce an unsupervised approach that measures true semantic separability. Our framework directly assesses semantic similarity by clustering semantically-rich features from foundational vision models, deliberately bypassing supervised classification on dataset labels. When applied to major web-scale datasets, the primary focus of this work, the high separability reported by supervised methods largely vanishes, with clustering accuracy dropping to near-chance levels. This reveals that conventional classification-based evaluation systematically overstates semantic bias by an overwhelming margin.