Representation Learning for Distributional Perturbation Extrapolation

📅 2025-04-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the problem of predicting RNA-seq distribution shifts induced by unseen genetic perturbations (e.g., knockdowns, drug combinations), i.e., out-of-distribution generalization. We propose the Perturbation Distribution Autoencoder (PDAE), a latent-variable model that explicitly encodes perturbations as affine transformations in the latent space. Our key theoretical contribution is the first identifiability proof: under sufficient diversity of training perturbations, the perturbation representations and their functional effects are uniquely identifiable up to affine equivalence; we further characterize the class of perturbations amenable to extrapolation. PDAE establishes the first perturbation modeling framework with distribution-level reconstruction guarantees. Evaluated on multiple real biological datasets, PDAE achieves state-of-the-art accuracy and robustness in predicting distributions under novel perturbations, empirically validating the practical utility of our theoretical extrapolation guarantees.

Technology Category

Application Category

📝 Abstract
We consider the problem of modelling the effects of unseen perturbations such as gene knockdowns or drug combinations on low-level measurements such as RNA sequencing data. Specifically, given data collected under some perturbations, we aim to predict the distribution of measurements for new perturbations. To address this challenging extrapolation task, we posit that perturbations act additively in a suitable, unknown embedding space. More precisely, we formulate the generative process underlying the observed data as a latent variable model, in which perturbations amount to mean shifts in latent space and can be combined additively. Unlike previous work, we prove that, given sufficiently diverse training perturbations, the representation and perturbation effects are identifiable up to affine transformation, and use this to characterize the class of unseen perturbations for which we obtain extrapolation guarantees. To estimate the model from data, we propose a new method, the perturbation distribution autoencoder (PDAE), which is trained by maximising the distributional similarity between true and predicted perturbation distributions. The trained model can then be used to predict previously unseen perturbation distributions. Empirical evidence suggests that PDAE compares favourably to existing methods and baselines at predicting the effects of unseen perturbations.
Problem

Research questions and friction points this paper is trying to address.

Predicting RNA data effects of unseen gene or drug perturbations
Modeling perturbation impacts via additive latent space shifts
Ensuring identifiability of representations for extrapolation guarantees
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses latent variable model for perturbation effects
Proposes perturbation distribution autoencoder (PDAE)
Ensures identifiability via diverse training perturbations
🔎 Similar Papers
No similar papers found.
J
Julius von Kugelgen
Seminar for Statistics, ETH Zürich
J
Jakob Ketterer
ETH Zürich
Xinwei Shen
Xinwei Shen
University of Washington
StatisticsMachine Learning
N
N. Meinshausen
Seminar for Statistics, ETH Zürich
Jonas Peters
Jonas Peters
Professor of Statistics, ETH Zurich
CausalityStatisticsMachine Learning