🤖 AI Summary
In multi-center medical data analysis, conventional ComBat-based batch effect correction methods suffer from data leakage due to reliance on ground-truth target-site labels—particularly detrimental under class imbalance. To address this, we propose PrettYharmonize, a label-agnostic harmonization method that introduces a novel “pretend-label” mechanism, enabling cross-site feature alignment without accessing true target labels. Built upon the ComBat framework, PrettYharmonize integrates pseudo-label-driven correction with controllable synthetic-data-based benchmarking. We systematically evaluate it on multi-center MRI and clinical datasets. Experiments demonstrate that PrettYharmonize achieves model performance comparable to label-dependent baselines while eliminating data leakage entirely. Moreover, it significantly enhances cross-site generalization robustness. By decoupling harmonization from sensitive label information, PrettYharmonize establishes a secure and effective paradigm for privacy-preserving distributed learning in federated medical analytics.
📝 Abstract
Machine learning (ML) models benefit from large datasets. Collecting data in biomedical domains is costly and challenging, hence, combining datasets has become a common practice. However, datasets obtained under different conditions could present undesired site-specific variability. Data harmonization methods aim to remove site-specific variance while retaining biologically relevant information. This study evaluates the effectiveness of popularly used ComBat-based methods for harmonizing data in scenarios where the class balance is not equal across sites. We find that these methods struggle with data leakage issues. To overcome this problem, we propose a novel approach PrettYharmonize, designed to harmonize data by pretending the target labels. We validate our approach using controlled datasets designed to benchmark the utility of harmonization. Finally, using real-world MRI and clinical data, we compare leakage-prone methods with PrettYharmonize and show that it achieves comparable performance while avoiding data leakage, particularly in site-target-dependence scenarios.