🤖 AI Summary
Discrete sequence generation—e.g., DNA, proteins, and language—requires flexible modeling across distinct geometric domains: discrete spaces, Euclidean (Gaussian) spaces, and simplices. Existing diffusion frameworks are domain-specific and lack interoperability.
Method: We unify these three paradigms by recognizing them as diffusion approximations of the Wright–Fisher population genetic model under different large-population limits. Based on this theoretical insight, we propose the first cross-domain unified diffusion framework. It yields a numerically stable simplex diffusion algorithm and enables both domain-agnostic inference—i.e., runtime switching between diffusion domains—and multi-domain joint training within a single model.
Contribution/Results: Our framework significantly outperforms prior simplex diffusion models on conditional DNA generation. Moreover, a single unified model achieves cross-domain generation performance competitive with state-of-the-art domain-specific models, demonstrating unprecedented flexibility and generalization across discrete, Euclidean, and probability-simplex geometries.
📝 Abstract
To model discrete sequences such as DNA, proteins, and language using diffusion, practitioners must choose between three major methods: diffusion in discrete space, Gaussian diffusion in Euclidean space, or diffusion on the simplex. Despite their shared goal, these models have disparate algorithms, theoretical structures, and tradeoffs: discrete diffusion has the most natural domain, Gaussian diffusion has more mature algorithms, and diffusion on the simplex in principle combines the strengths of the other two but in practice suffers from a numerically unstable stochastic processes. Ideally we could see each of these models as instances of the same underlying framework, and enable practitioners to switch between models for downstream applications. However previous theories have only considered connections in special cases. Here we build a theory unifying all three methods of discrete diffusion as different parameterizations of the same underlying process: the Wright-Fisher population genetics model. In particular, we find simplicial and Gaussian diffusion as two large-population limits. Our theory formally connects the likelihoods and hyperparameters of these models and leverages decades of mathematical genetics literature to unlock stable simplicial diffusion. Finally, we relieve the practitioner of balancing model trade-offs by demonstrating it is possible to train a single model that can perform diffusion in any of these three domains at test time. Our experiments show that Wright-Fisher simplicial diffusion is more stable and outperforms previous simplicial diffusion models on conditional DNA generation. We also show that we can train models on multiple domains at once that are competitive with models trained on any individual domain.