🤖 AI Summary
Single-cell RNA sequencing data are inherently high-dimensional, sparse, and unordered, posing challenges for conventional autoregressive generative models that often introduce sequential bias and error accumulation. To address this, this work proposes scDiVa, a masked discrete diffusion foundation model that precisely aligns with the dropout-induced missingness in single-cell data through a continuous-time forward masking mechanism, jointly modeling discrete gene identities and continuous expression values. scDiVa innovatively integrates a bidirectional denoising architecture, entropy-normalized serialization, and latent anchor tokens to enhance expression reconstruction accuracy while preserving global cellular identity consistency. Pretrained on 59 million cells, scDiVa demonstrates exceptional transfer performance across diverse downstream tasks, including batch integration, cell type annotation, and perturbation response prediction.
📝 Abstract
Single-cell RNA-seq profiles are high-dimensional, sparse, and unordered, causing autoregressive generation to impose an artificial ordering bias and suffer from error accumulation. To address this, we propose scDiVa, a masked discrete diffusion foundation model that aligns generation with the dropout-like corruption process by defining a continuous-time forward masking mechanism in token space. ScDiVa features a bidirectional denoiser that jointly models discrete gene identities and continuous values, utilizing entropy-normalized serialization and a latent anchor token to maximize information efficiency and preserve global cell identity. The model is trained via depth-invariant time sampling and a dual denoising objective to simulate varying sparsity levels while ensuring precise recovery of both identity and magnitude. Pre-trained on 59 million cells, scDiVa achieves strong transfer performance across major benchmarks, including batch integration, cell type annotation, and perturbation response prediction. These results suggest that masked discrete diffusion serves as a biologically coherent and effective alternative to autoregression.