ShortListing Model: A Streamlined SimplexDiffusion for Discrete Variable Generation

📅 2025-08-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Discrete variable generation poses significant challenges in natural language processing and biomolecular sequence design, including difficulties in modeling discrete distributions and inefficient sampling. This paper introduces the Simplex-Centered Shortlist Model (SLM), which maps discrete sequences into a continuous simplex space and integrates discrete diffusion modeling with progressive candidate set pruning to dynamically shrink the search space during denoising. A novel classifier-free guidance mechanism is incorporated to enhance both unconditional generation quality and controllability. Experiments demonstrate that SLM achieves state-of-the-art performance on diverse tasks—including DNA regulatory sequence generation, protein sequence design, and multi-granularity language modeling—while exhibiting high efficiency, scalability, and cross-domain generalization. By unifying discrete structure modeling with geometrically informed diffusion and adaptive search, SLM establishes a new paradigm for generative modeling of discrete sequences.

Technology Category

Application Category

📝 Abstract
Generative modeling of discrete variables is challenging yet crucial for applications in natural language processing and biological sequence design. We introduce the Shortlisting Model (SLM), a novel simplex-based diffusion model inspired by progressive candidate pruning. SLM operates on simplex centroids, reducing generation complexity and enhancing scalability. Additionally, SLM incorporates a flexible implementation of classifier-free guidance, enhancing unconditional generation performance. Extensive experiments on DNA promoter and enhancer design, protein design, character-level and large-vocabulary language modeling demonstrate the competitive performance and strong potential of SLM. Our code can be found at https://github.com/GenSI-THUAIR/SLM
Problem

Research questions and friction points this paper is trying to address.

Develops simplex diffusion model for discrete variable generation
Addresses complexity in natural language and biological sequences
Enhances scalability and performance in unconditional generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Simplex-based diffusion model for discrete variables
Classifier-free guidance for unconditional generation
Centroid-based approach reducing complexity and enhancing scalability
Yuxuan Song
Yuxuan Song
Tsinghua University
Deep Generative ModelsLLM4Science,
Z
Zhe Zhang
Generative Symbolic Intelligence Lab (GenSI), Tsinghua University
Y
Yu Pei
Generative Symbolic Intelligence Lab (GenSI), Tsinghua University
Jingjing Gong
Jingjing Gong
SII
Machine LearningAI for ScienceLarge Language ModelEmbodied AI
Qiying Yu
Qiying Yu
Tsinghua University
Multimodal LearningSelf-supervised LearningLarge Models
Z
Zheng Zhang
ByteDance Seed
M
Mingxuan Wang
ByteDance Seed
H
Hao Zhou
Institute for AI Industry Research (AIR), Tsinghua University
J
Jingjing Liu
Generative Symbolic Intelligence Lab (GenSI), Tsinghua University
Wei-Ying Ma
Wei-Ying Ma
Tsinghua University
Generative AI and Large Language Models (LLMs) for Science