Supervised Contrastive Block Disentanglement

📅 2025-02-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Real-world multi-source data often exhibit spurious correlations and confounding effects due to experimental condition variations. To address this, we propose SCBD, a novel algorithm that introduces supervised contrastive learning into block-level disentangled representation learning. SCBD constructs two complementary embedding spaces: one invariant to environment variables (e) yet discriminative for target labels (y), and another explicitly encoding (e)-related confounders. A tunable hyperparameter (alpha) governs the strength of (e)-invariance, enabling controllable trade-offs between in-distribution (ID) and out-of-distribution (OOD) generalization. On the Camelyon17-WILDS domain generalization benchmark, SCBD achieves significant OOD accuracy improvements. Furthermore, it successfully corrects batch effects in a large-scale single-cell imaging dataset comprising 26 million images—effectively eliminating plate-level batch biases while fully preserving biologically meaningful signals.

Technology Category

Application Category

📝 Abstract
Real-world datasets often combine data collected under different experimental conditions. This yields larger datasets, but also introduces spurious correlations that make it difficult to model the phenomena of interest. We address this by learning two embeddings to independently represent the phenomena of interest and the spurious correlations. The embedding representing the phenomena of interest is correlated with the target variable $y$, and is invariant to the environment variable $e$. In contrast, the embedding representing the spurious correlations is correlated with $e$. The invariance to $e$ is difficult to achieve on real-world datasets. Our primary contribution is an algorithm called Supervised Contrastive Block Disentanglement (SCBD) that effectively enforces this invariance. It is based purely on Supervised Contrastive Learning, and applies to real-world data better than existing approaches. We empirically validate SCBD on two challenging problems. The first problem is domain generalization, where we achieve strong performance on a synthetic dataset, as well as on Camelyon17-WILDS. We introduce a single hyperparameter $alpha$ to control the degree of invariance to $e$. When we increase $alpha$ to strengthen the degree of invariance, out-of-distribution performance improves at the expense of in-distribution performance. The second problem is batch correction, in which we apply SCBD to preserve biological signal and remove inter-well batch effects when modeling single-cell perturbations from 26 million Optical Pooled Screening images.
Problem

Research questions and friction points this paper is trying to address.

disentangling spurious correlations from real-world datasets
learning invariant embeddings for phenomena of interest
improving domain generalization and batch correction performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Supervised Contrastive Block Disentanglement
Two embeddings for phenomena and spurious correlations
Invariance enforced via Supervised Contrastive Learning
T
Taro Makino
Center for Data Science, New York University; Prescient Design, Genentech
J
Ji Won Park
Prescient Design, Genentech
Natasa Tagasovska
Natasa Tagasovska
Prescient Design | Genentech | Roche
machine learningcausalitygenerative modelscopulas
T
Takamasa Kudo
Research and Early Development (gRED), Genentech
P
Paula Coelho
Research and Early Development (gRED), Genentech
J
Jan-Christian Huetter
Biology Research — AI Development (BRAID), Genentech
Heming Yao
Heming Yao
Genentech
B
Burkhard Hoeckendorf
Biology Research — AI Development (BRAID), Genentech
A
A. C. Leote
Research and Early Development (gRED), Genentech
Stephen Ra
Stephen Ra
Senior Director, Prescient Design, Genentech
Machine learningStatisticsNeuroscience
David Richmond
David Richmond
AI and Machine Learning Scientist
computer vision for biomedical images
Kyunghyun Cho
Kyunghyun Cho
New York University, Genentech
Machine LearningDeep Learning
A
Aviv Regev
Research and Early Development (gRED), Genentech
Romain Lopez
Romain Lopez
Assistant Professor, New York University
Bayesian modelsMachine LearningComputational Biology