Chem3DLLM: 3D Multimodal Large Language Models for Chemistry

📅 2025-08-14

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

Existing autoregressive language models struggle to generate 3D molecular conformations due to three key limitations: (1) incompatibility between continuous 3D geometric structures and discrete token spaces; (2) difficulty in unifying heterogeneous modalities—such as proteins, ligands, and text—within a single architecture; and (3) absence of physicochemical priors, hindering effective structural constraint enforcement. To address these challenges, we introduce the first protein-conditioned multimodal large language model. Our method features a novel invertible 3D molecular-text encoder—incorporating run-length compression for lossless 3× geometric compression—alongside a protein embedding projector and a stability-driven reinforcement learning optimization framework. The model jointly generates protein binding pockets and ligand conformations, achieving a state-of-the-art −7.21 Vina score on structure-based drug design tasks. This result validates both its scientific soundness and practical efficacy in real-world drug discovery.

Technology Category

Application Category

📝 Abstract

In the real world, a molecule is a 3D geometric structure. Compared to 1D SMILES sequences and 2D molecular graphs, 3D molecules represent the most informative molecular modality. Despite the rapid progress of autoregressive-based language models, they cannot handle the generation of 3D molecular conformation due to several challenges: 1) 3D molecular structures are incompatible with LLMs' discrete token space, 2) integrating heterogeneous inputs like proteins, ligands, and text remains difficult within a unified model, and 3) LLMs lack essential scientific priors, hindering the enforcement of physical and chemical constraints during generation. To tackle these issues, we present Chem3DLLM, a unified protein-conditioned multimodal large language model. Our approach designs a novel reversible text encoding for 3D molecular structures using run-length compression, achieving 3x size reduction while preserving complete structural information. This enables seamless integration of molecular geometry with protein pocket features in a single LLM architecture. We employ reinforcement learning with stability-based rewards to optimize chemical validity and incorporate a lightweight protein embedding projector for end-to-end training. Experimental results on structure-based drug design demonstrate state-of-the-art performance with a Vina score of -7.21, validating our unified multimodal approach for practical drug discovery applications.

Problem

Research questions and friction points this paper is trying to address.

Handling 3D molecular generation incompatible with LLMs' token space

Integrating heterogeneous inputs like proteins, ligands, and text

Enforcing physical and chemical constraints during molecular generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reversible text encoding for 3D molecules

Reinforcement learning with stability rewards

Lightweight protein embedding projector

🔎 Similar Papers

3D-MolT5: Towards Unified 3D Molecule-Text Modeling with 3D Molecular Tokenization