Impact of Leakage on Data Harmonization in Machine Learning Pipelines in Class Imbalance Across Sites

📅 2024-10-25

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

In multi-center medical data analysis, conventional ComBat-based batch effect correction methods suffer from data leakage due to reliance on ground-truth target-site labels—particularly detrimental under class imbalance. To address this, we propose PrettYharmonize, a label-agnostic harmonization method that introduces a novel “pretend-label” mechanism, enabling cross-site feature alignment without accessing true target labels. Built upon the ComBat framework, PrettYharmonize integrates pseudo-label-driven correction with controllable synthetic-data-based benchmarking. We systematically evaluate it on multi-center MRI and clinical datasets. Experiments demonstrate that PrettYharmonize achieves model performance comparable to label-dependent baselines while eliminating data leakage entirely. Moreover, it significantly enhances cross-site generalization robustness. By decoupling harmonization from sensitive label information, PrettYharmonize establishes a secure and effective paradigm for privacy-preserving distributed learning in federated medical analytics.

Technology Category

Application Category

📝 Abstract

Machine learning (ML) models benefit from large datasets. Collecting data in biomedical domains is costly and challenging, hence, combining datasets has become a common practice. However, datasets obtained under different conditions could present undesired site-specific variability. Data harmonization methods aim to remove site-specific variance while retaining biologically relevant information. This study evaluates the effectiveness of popularly used ComBat-based methods for harmonizing data in scenarios where the class balance is not equal across sites. We find that these methods struggle with data leakage issues. To overcome this problem, we propose a novel approach PrettYharmonize, designed to harmonize data by pretending the target labels. We validate our approach using controlled datasets designed to benchmark the utility of harmonization. Finally, using real-world MRI and clinical data, we compare leakage-prone methods with PrettYharmonize and show that it achieves comparable performance while avoiding data leakage, particularly in site-target-dependence scenarios.

Problem

Research questions and friction points this paper is trying to address.

Evaluating ComBat-based methods for data harmonization with class imbalance

Addressing data leakage issues in harmonization across imbalanced sites

Proposing PrettYharmonize to prevent leakage while preserving biological information

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes PrettYharmonize to prevent data leakage

Uses controlled datasets for harmonization benchmarking

Validates with MRI and clinical data comparisons

🔎 Similar Papers

A Survey on Group Fairness in Federated Learning: Challenges, Taxonomy of Solutions and Directions for Future Research