Navigating Chemical-Linguistic Sharing Space with Heterogeneous Molecular Encoding

📅 2024-12-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Chemical language models face two key challenges: (1) the semantic misalignment between molecular representations and natural language spaces, and (2) the scarcity of high-quality molecule–text paired data. To address these, we propose the Heterogeneous Molecular Encoding (HME) framework—the first to jointly model molecular fragments, topological structures, and 3D conformations via Q-learning. We further construct MCMoD, a million-scale, multi-condition molecule–text dataset. Methodologically, HME integrates multi-view molecular representation alignment with joint fine-tuning of chemical and language models, enabling zero-shot cross-domain co-modeling of chemical and linguistic spaces. Experiments demonstrate significant improvements: 37.8% gain in Fréchet ChemNet Distance (FCD) for molecular design and 11.6% increase in BLEU score for text generation. Moreover, HME exhibits superior generalization and cross-task robustness under multi-constraint and zero-shot settings.

Technology Category

Application Category

📝 Abstract
Chemical language models (CLMs) are prominent for their effectiveness in exploring chemical space and enabling molecular engineering. However, while exploring chemical-linguistic space, CLMs suffer from the gap between natural language and molecular representations. This challenge is primarily due to the inherent modeling differences between molecules and texts: molecules operate unified modeling to learn chemical space, while natural language sequentially models the semantic space. Additionally, the limited availability of high-quality text-to-molecule datasets further exacerbates this challenge. To address the problem, we first verified the information bias in molecular representations from different perspectives. We then developed the Heterogeneous Molecular Encoding (HME) framework, a unified molecular encoder compressing the molecular features from fragment sequence, topology, and conformation with Q-learning. To better model chemical-linguistic space, we further constructed the MCMoD dataset, which contains over one million molecules with various conditions, including properties, fragments, and descriptions. Experimentally, HME promotes CLMs to achieve chemical-linguistic sharing space exploration: (1) chemical space exploration with linguistic guidance, where HME achieves significant improvements (+37.8% FCD) for molecular design in multiple constraints, even in zero-shot scenarios; (2) linguistic space exploration with molecular guidance, where HME generates textual descriptions with high qualities (+11.6% BLEU) for molecules. These results highlight the precision of HME in handling multi-objective and cross-domain tasks, as well as its remarkable generalization capability on unseen task combinations. HME offers a new perspective on navigating chemical-linguistic sharing space, advancing the potential of CLMs in both fundamental research and practical applications in chemistry.
Problem

Research questions and friction points this paper is trying to address.

Chemical Language Model
Molecular Expression
Quality Data Association
Innovation

Methods, ideas, or system contributions that make the work stand out.

Heterogeneous Molecular Encoding (HME)
Molecular Chemistry and Language Model (CLMs) Integration
MCMoD Dataset for Enhanced Molecular Design
Liuzhenghao Lv
Liuzhenghao Lv
Phd student Computer Science, Peking University
Large Language ModelsAI for ScienceSpiking Neural Networks
H
Hao Li
School of Electronic and Computer Engineering, Peking University Shenzhen Graduate School, Shenzhen, 518055, China; Peng Cheng Laboratory, Shenzhen, 518000, China; AI for Science (AI4S)-Preferred Program, Peking University Shenzhen Graduate School, Shenzhen, 518055, China
Y
Yu Wang
School of Electronic and Computer Engineering, Peking University Shenzhen Graduate School, Shenzhen, 518055, China; AI for Science (AI4S)-Preferred Program, Peking University Shenzhen Graduate School, Shenzhen, 518055, China
Z
Zhiyuan Yan
School of Electronic and Computer Engineering, Peking University Shenzhen Graduate School, Shenzhen, 518055, China
Z
Zi-Xuan Chen
School of Electronic and Computer Engineering, Peking University Shenzhen Graduate School, Shenzhen, 518055, China
Z
Zongying Lin
School of Electronic and Computer Engineering, Peking University Shenzhen Graduate School, Shenzhen, 518055, China
Li Yuan
Li Yuan
Research Associate, University of Science & Technology of China (USTC)
Antibiotic resistanceWastewater treatmentEnvironmental bioremediationAnaerobic digestionFate of organic pollutants
Y
Yonghong Tian
School of Electronic and Computer Engineering, Peking University Shenzhen Graduate School, Shenzhen, 518055, China; Peng Cheng Laboratory, Shenzhen, 518000, China; AI for Science (AI4S)-Preferred Program, Peking University Shenzhen Graduate School, Shenzhen, 518055, China