🤖 AI Summary
Existing gene-level models struggle to capture the heterogeneous transcriptional responses elicited by perturbations at distinct genomic loci, limiting both mechanistic modeling of gene regulation and zero-shot generalization. This work proposes a novel approach that elevates perturbation modeling from gene identifiers to regulatory DNA sequences. By integrating sequence encoding, conditional optimal transport, and single-cell CRISPR perturbation data, the authors develop a locus-specific generative model capable of zero-shot prediction across approximately 95% of the genome. The method achieves a 33% improvement in discriminative performance under low-data regimes, sets a new state-of-the-art on benchmarks involving unseen gene perturbations, and demonstrates up to a 0.14 increase in Pearson correlation coefficient in cross-cell-line transfer tasks. Notably, it uncovers functional differences at transcription start sites that are overlooked by conventional models.
📝 Abstract
Predicting how genetic perturbations change cellular state is a core problem for building controllable models of gene regulation. Perturbations targeting the same gene can produce different transcriptional responses depending on their genomic locus, including different transcription start sites and regulatory elements. Gene-level perturbation models collapse these distinct interventions into the same representation. We introduce STRAND, a generative model that predicts single-cell transcriptional responses by conditioning on regulatory DNA sequence. STRAND represents a perturbation by encoding the sequence at its genomic locus and uses this representation to parameterize a conditional transport process from control to perturbed cell states. Representing perturbations by sequence, rather than by a fixed set of gene identifiers, supports zero-shot inference at loci not seen during training and expands inference-time genomic coverage from ~1.5% for gene-level single-cell foundation models to ~95% of the genome. We evaluate STRAND on CRISPR perturbation datasets in K562, Jurkat, and RPE1 cells. STRAND improves discrimination scores by up to 33% in low-sample regimes, achieves the best average rank on unseen gene perturbation benchmarks, and improves transfer to novel cell lines by up to 0.14 in Pearson correlation. Ablations isolate the gains to sequence conditioning and transport, and case studies show that STRAND resolves functionally alternative transcription start sites missed by gene-level models.