🤖 AI Summary
Discrete variable generation poses significant challenges in natural language processing and biomolecular sequence design, including difficulties in modeling discrete distributions and inefficient sampling. This paper introduces the Simplex-Centered Shortlist Model (SLM), which maps discrete sequences into a continuous simplex space and integrates discrete diffusion modeling with progressive candidate set pruning to dynamically shrink the search space during denoising. A novel classifier-free guidance mechanism is incorporated to enhance both unconditional generation quality and controllability. Experiments demonstrate that SLM achieves state-of-the-art performance on diverse tasks—including DNA regulatory sequence generation, protein sequence design, and multi-granularity language modeling—while exhibiting high efficiency, scalability, and cross-domain generalization. By unifying discrete structure modeling with geometrically informed diffusion and adaptive search, SLM establishes a new paradigm for generative modeling of discrete sequences.
📝 Abstract
Generative modeling of discrete variables is challenging yet crucial for applications in natural language processing and biological sequence design. We introduce the Shortlisting Model (SLM), a novel simplex-based diffusion model inspired by progressive candidate pruning. SLM operates on simplex centroids, reducing generation complexity and enhancing scalability. Additionally, SLM incorporates a flexible implementation of classifier-free guidance, enhancing unconditional generation performance. Extensive experiments on DNA promoter and enhancer design, protein design, character-level and large-vocabulary language modeling demonstrate the competitive performance and strong potential of SLM. Our code can be found at https://github.com/GenSI-THUAIR/SLM