Beyond Pooling: Matching for Robust Generalization under Data Heterogeneity

๐Ÿ“… 2026-02-06
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the challenge of representation bias induced by conventional pooling methods in cross-domain settings characterized by strong heterogeneity and highly asymmetric data distributions, which severely undermines zero-shot generalization. To mitigate this issue, the paper introducesโ€” for the first timeโ€”a matching-based mechanism into representation learning, proposing a novel framework that integrates adaptive centroid selection with propensity score matching, further enhanced by doubly robust estimation to effectively filter out confounding domain interference. The approach demonstrates significant superiority over standard pooling and uniform sampling strategies under non-Gaussian, multimodal, and extremely asymmetric distributional conditions. Both theoretical analysis and empirical evaluations confirm its enhanced performance in handling asymmetric meta-distributions, with notable improvements achieved in zero-shot medical anomaly detection tasks.

Technology Category

Application Category

๐Ÿ“ Abstract
Pooling heterogeneous datasets across domains is a common strategy in representation learning, but naive pooling can amplify distributional asymmetries and yield biased estimators, especially in settings where zero-shot generalization is required. We propose a matching framework that selects samples relative to an adaptive centroid and iteratively refines the representation distribution. The double robustness and the propensity score matching for the inclusion of data domains make matching more robust than naive pooling and uniform subsampling by filtering out the confounding domains (the main cause of heterogeneity). Theoretical and empirical analyses show that, unlike naive pooling or uniform subsampling, matching achieves better results under asymmetric meta-distributions, which are also extended to non-Gaussian and multimodal real-world settings. Most importantly, we show that these improvements translate to zero-shot medical anomaly detection, one of the extreme forms of data heterogeneity and asymmetry. The code is available on https://github.com/AyushRoy2001/Beyond-Pooling.
Problem

Research questions and friction points this paper is trying to address.

data heterogeneity
zero-shot generalization
distributional asymmetry
confounding domains
robust representation learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

matching
data heterogeneity
zero-shot generalization
propensity score matching
robust representation learning
๐Ÿ”Ž Similar Papers
No similar papers found.