Graph Diffusion Transformers are In-Context Molecular Designers

📅 2025-10-09

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

Poor in-context learning performance and scarce attribute annotations hinder molecular design. This paper introduces DemoDiff—the first diffusion model for molecular generation that incorporates in-context learning, enabling task-aware generation from only a few molecule–property demonstrations. Methodologically, we propose a fragment-based Node Pair encoding tokenizer to drastically compress graph size, and integrate a graph diffusion Transformer with a demonstration-conditional denoising mechanism to support efficient large-scale pretraining. Evaluated on 33 property-guided design tasks, DemoDiff achieves an average rank of 3.63—outperforming or matching large language models with 100×–1000× more parameters, and significantly surpassing existing domain-specific methods. Key contributions include: (i) the first in-context learning paradigm for molecular generation; (ii) a lightweight, efficient molecular graph representation; and (iii) a scalable, task-oriented pretraining framework for molecular design.

Technology Category

Application Category

📝 Abstract

In-context learning allows large models to adapt to new tasks from a few demonstrations, but it has shown limited success in molecular design. Existing databases such as ChEMBL contain molecular properties spanning millions of biological assays, yet labeled data for each property remain scarce. To address this limitation, we introduce demonstration-conditioned diffusion models (DemoDiff), which define task contexts using a small set of molecule-score examples instead of text descriptions. These demonstrations guide a denoising Transformer to generate molecules aligned with target properties. For scalable pretraining, we develop a new molecular tokenizer with Node Pair Encoding that represents molecules at the motif level, requiring 5.5$ imes$ fewer nodes. We curate a dataset containing millions of context tasks from multiple sources covering both drugs and materials, and pretrain a 0.7-billion-parameter model on it. Across 33 design tasks in six categories, DemoDiff matches or surpasses language models 100-1000$ imes$ larger and achieves an average rank of 3.63 compared to 5.25-10.20 for domain-specific approaches. These results position DemoDiff as a molecular foundation model for in-context molecular design. Our code is available at https://github.com/liugangcode/DemoDiff.

Problem

Research questions and friction points this paper is trying to address.

Addressing limited in-context learning success in molecular design

Solving scarce labeled data for molecular property prediction

Developing scalable pretraining for molecular foundation models

Innovation

Methods, ideas, or system contributions that make the work stand out.

DemoDiff uses demonstration-conditioned diffusion for molecular design

Node Pair Encoding tokenizer reduces molecular nodes by 5.5x

Transformer generates molecules guided by few molecule-score examples

🔎 Similar Papers

No similar papers found.