3D-MolT5: Towards Unified 3D Molecule-Text Modeling with 3D Molecular Tokenization

📅 2024-06-09
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Existing molecular-language multimodal models largely neglect 3D structural information and suffer from weak cross-modal interaction and poor modality alignment during pretraining. To address these limitations, we propose the first generative foundation model unifying 1D molecular sequences, 3D geometric structures, and natural language. Our method introduces a novel fine-grained substructure-to-3D-token mapping mechanism grounded in 3D molecular fingerprints, enabling end-to-end joint modeling of all three modalities within a shared token space. It integrates SELFIES-based 1D encoding, 3D geometry fingerprinting, learnable 3D tokenization, unified 1D/3D self-supervised pretraining, and multi-task instruction tuning. Evaluated on molecular property prediction, molecule image captioning, and text-guided molecular generation, our model achieves state-of-the-art performance across all benchmarks. It significantly enhances cross-modal understanding and generation capabilities, establishing a new paradigm for unified molecular–language representation learning.

Technology Category

Application Category

📝 Abstract
The integration of molecule and language has garnered increasing attention in molecular science. Recent advancements in Language Models (LMs) have demonstrated potential for the comprehensive modeling of molecule and language. However, existing works exhibit notable limitations. Most existing works overlook the modeling of 3D information, which is crucial for understanding molecular structures and also functions. While some attempts have been made to leverage external structure encoding modules to inject the 3D molecular information into LMs, there exist obvious difficulties that hinder the integration of molecular structure and language text, such as modality alignment and separate tuning. To bridge this gap, we propose 3D-MolT5, a unified framework designed to model both 1D molecular sequence and 3D molecular structure. The key innovation lies in our methodology for mapping fine-grained 3D substructure representations (based on 3D molecular fingerprints) to a specialized 3D token vocabulary for 3D-MolT5. This 3D structure token vocabulary enables the seamless combination of 1D sequence and 3D structure representations in a tokenized format, allowing 3D-MolT5 to encode molecular sequence (SELFIES), molecular structure, and text sequences within a unified architecture. Alongside, we further introduce 1D and 3D joint pre-training to enhance the model's comprehension of these diverse modalities in a joint representation space and better generalize to various tasks for our foundation model. Through instruction tuning on multiple downstream datasets, our proposed 3D-MolT5 shows superior performance than existing methods in molecular property prediction, molecule captioning, and text-based molecule generation tasks. Our code will be available on GitHub soon.
Problem

Research questions and friction points this paper is trying to address.

Integrates molecular and natural language representations for comprehensive modeling.
Addresses neglect of 3D molecular information in existing approaches.
Enhances cross-modal interaction and alignment in molecule-text modeling.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified framework for molecule-text modeling
Mapping 3D substructures to token vocabulary
Joint pre-training with multi-task objectives