🤖 AI Summary
Existing autoregressive language models struggle to generate 3D molecular conformations due to three key limitations: (1) incompatibility between continuous 3D geometric structures and discrete token spaces; (2) difficulty in unifying heterogeneous modalities—such as proteins, ligands, and text—within a single architecture; and (3) absence of physicochemical priors, hindering effective structural constraint enforcement. To address these challenges, we introduce the first protein-conditioned multimodal large language model. Our method features a novel invertible 3D molecular-text encoder—incorporating run-length compression for lossless 3× geometric compression—alongside a protein embedding projector and a stability-driven reinforcement learning optimization framework. The model jointly generates protein binding pockets and ligand conformations, achieving a state-of-the-art −7.21 Vina score on structure-based drug design tasks. This result validates both its scientific soundness and practical efficacy in real-world drug discovery.
📝 Abstract
In the real world, a molecule is a 3D geometric structure. Compared to 1D SMILES sequences and 2D molecular graphs, 3D molecules represent the most informative molecular modality. Despite the rapid progress of autoregressive-based language models, they cannot handle the generation of 3D molecular conformation due to several challenges: 1) 3D molecular structures are incompatible with LLMs' discrete token space, 2) integrating heterogeneous inputs like proteins, ligands, and text remains difficult within a unified model, and 3) LLMs lack essential scientific priors, hindering the enforcement of physical and chemical constraints during generation. To tackle these issues, we present Chem3DLLM, a unified protein-conditioned multimodal large language model. Our approach designs a novel reversible text encoding for 3D molecular structures using run-length compression, achieving 3x size reduction while preserving complete structural information. This enables seamless integration of molecular geometry with protein pocket features in a single LLM architecture. We employ reinforcement learning with stability-based rewards to optimize chemical validity and incorporate a lightweight protein embedding projector for end-to-end training. Experimental results on structure-based drug design demonstrate state-of-the-art performance with a Vina score of -7.21, validating our unified multimodal approach for practical drug discovery applications.