Impact of Leakage on Data Harmonization in Machine Learning Pipelines in Class Imbalance Across Sites

📅 2024-10-25
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In multi-center medical data analysis, conventional ComBat-based batch effect correction methods suffer from data leakage due to reliance on ground-truth target-site labels—particularly detrimental under class imbalance. To address this, we propose PrettYharmonize, a label-agnostic harmonization method that introduces a novel “pretend-label” mechanism, enabling cross-site feature alignment without accessing true target labels. Built upon the ComBat framework, PrettYharmonize integrates pseudo-label-driven correction with controllable synthetic-data-based benchmarking. We systematically evaluate it on multi-center MRI and clinical datasets. Experiments demonstrate that PrettYharmonize achieves model performance comparable to label-dependent baselines while eliminating data leakage entirely. Moreover, it significantly enhances cross-site generalization robustness. By decoupling harmonization from sensitive label information, PrettYharmonize establishes a secure and effective paradigm for privacy-preserving distributed learning in federated medical analytics.

Technology Category

Application Category

📝 Abstract
Machine learning (ML) models benefit from large datasets. Collecting data in biomedical domains is costly and challenging, hence, combining datasets has become a common practice. However, datasets obtained under different conditions could present undesired site-specific variability. Data harmonization methods aim to remove site-specific variance while retaining biologically relevant information. This study evaluates the effectiveness of popularly used ComBat-based methods for harmonizing data in scenarios where the class balance is not equal across sites. We find that these methods struggle with data leakage issues. To overcome this problem, we propose a novel approach PrettYharmonize, designed to harmonize data by pretending the target labels. We validate our approach using controlled datasets designed to benchmark the utility of harmonization. Finally, using real-world MRI and clinical data, we compare leakage-prone methods with PrettYharmonize and show that it achieves comparable performance while avoiding data leakage, particularly in site-target-dependence scenarios.
Problem

Research questions and friction points this paper is trying to address.

Evaluating ComBat-based methods for data harmonization with class imbalance
Addressing data leakage issues in harmonization across imbalanced sites
Proposing PrettYharmonize to prevent leakage while preserving biological information
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes PrettYharmonize to prevent data leakage
Uses controlled datasets for harmonization benchmarking
Validates with MRI and clinical data comparisons
N
Nicol'as Nieto
Institute of Neuroscience and Medicine (INM-7: Brain and Behaviour), Research Centre Jülich, Jülich, Germany; Institute of Systems Neuroscience, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
Simon B. Eickhoff
Simon B. Eickhoff
Institute of Neuroscience and Medicine (INM-7: Brain and Behaviour), Research Centre Jülich, Jülich, Germany; Institute of Systems Neuroscience, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
Christian Jung
Christian Jung
Department Head Security Engineering
M
Martin Reuter
Artificial Intelligence in Medical Imaging, German Center for Neurodegenerative Diseases (DZNE), Bonn, Germany; Department of Radiology, Harvard Medical School, Boston, MA, USA
K
K. Diers
Artificial Intelligence in Medical Imaging, German Center for Neurodegenerative Diseases (DZNE), Bonn, Germany
M
Malte Kelm
Department of Cardiology, Pulmonology and Vascular Medicine, University Hospital and Medical Faculty, Heinrich-Heine University, Duesseldorf, Germany; Cardiovascular Research Institute Düsseldorf (CARID), Medical Faculty, Heinrich-Heine University, Duesseldorf, Germany
A
Artur Lichtenberg
Department of Cardiac Surgery, University Hospital and Medical Faculty, Heinrich-Heine University, Duesseldorf, Germany
F
F. Raimondo
Institute of Neuroscience and Medicine (INM-7: Brain and Behaviour), Research Centre Jülich, Jülich, Germany; Institute of Systems Neuroscience, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
K
K. Patil
Institute of Neuroscience and Medicine (INM-7: Brain and Behaviour), Research Centre Jülich, Jülich, Germany; Institute of Systems Neuroscience, Heinrich Heine University Düsseldorf, Düsseldorf, Germany