🤖 AI Summary
This work addresses the fundamental challenge in single-cell perturbation experiments where the destructive nature of sequencing precludes observing pre- and post-perturbation states in the same cell, and existing methods struggle to model multimodal response distributions arising from latent variables such as microenvironmental context and batch effects. To overcome these limitations, the authors introduce— for the first time—a diffusion generative model operating directly in distribution space. By embedding cellular population distributions into a reproducing kernel Hilbert space (RKHS), they construct a diffusion process that acts on probability measures, explicitly modeling the distributional evolution induced by perturbations. This approach moves beyond the conventional single-response assumption and achieves state-of-the-art performance across multiple single-cell transcriptomic benchmark datasets, substantially improving generalization to unseen perturbations.
📝 Abstract
Building Virtual Cells that can accurately simulate cellular responses to perturbations is a long-standing goal in systems biology. A fundamental challenge is that high-throughput single-cell sequencing is destructive: the same cell cannot be observed both before and after a perturbation. Thus, perturbation prediction requires mapping unpaired control and perturbed populations. Existing models address this by learning maps between distributions, but typically assume a single fixed response distribution when conditioned on observed cellular context (e.g., cell type) and the perturbation type. In reality, responses vary systematically due to unobservable latent factors such as microenvironmental fluctuations and complex batch effects, forming a manifold of possible distributions for the same observed conditions. To account for this variability, we introduce PerturbDiff, which shifts modeling from individual cells to entire distributions. By embedding distributions as points in a Hilbert space, we define a diffusion-based generative process operating directly over probability distributions. This allows PerturbDiff to capture population-level response shifts across hidden factors. Benchmarks on established datasets show that PerturbDiff achieves state-of-the-art performance in single-cell response prediction and generalizes substantially better to unseen perturbations. See our project page (https://katarinayuan.github.io/PerturbDiff-ProjectPage/), where code and data will be made publicly available (https://github.com/DeepGraphLearning/PerturbDiff).