🤖 AI Summary
This study addresses implicit domain shifts in unlabeled datasets caused by variations in devices or acquisition conditions. The authors propose a novel method that integrates high-dimensional density-based anomaly detection with interpretable subspace attribution. By identifying local density anomalies, the approach pinpoints the key feature subspaces responsible for the shift and extracts a subset of samples with consistent distributions to enable unsupervised shift correction. Its key innovation lies in the first unified framework that jointly models the detection and interpretation of both global and local domain shifts. Experimental results demonstrate that the method accurately recovers known shifts on a 20-dimensional benchmark dataset and successfully identifies device-induced shifts in a 782-dimensional electrocardiogram (ECG) dataset, precisely localizing the relevant ECG features.
📝 Abstract
We developed a tool for detecting domain shifts, namely subtle differences in the probability distributions of datasets. We identify these shifts using an algorithm designed to detect localised density anomalies in high-dimensional feature spaces. If an anomaly is present, we then identify the feature subspace in which the anomaly is most pronounced. This allows us to trace the domain shift to a small set of features, making the shift interpretable. Moreover, we provide a protocol for compensating domain shifts by extracting, from two unlabelled datasets, subsets of samples with no detectable residual distributional difference. We validate the framework on controlled 20-dimensional benchmarks with known ground truth, recovering both broad and localized shifts together with their supporting feature subspaces. We then apply it to healthy electrocardiogram (ECG) recordings represented by 782 features. In age- and sex-matched cohort comparisons differing in measurement-device composition, the method detects device-induced shifts, extracts representative subsets enriched in the imbalanced device components, and identifies ECG features associated with the acquisition contrast. These results suggest that density-shift detection and subspace attribution provide a practical framework for uncovering hidden cohort biases before downstream modelling.