A quantitative analysis of knowledge-learning preferences in large language models in molecular science

📅 2024-02-06
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates modality preferences and data adaptability of large language models (LLMs) in molecular science knowledge acquisition. To address this, we introduce ChEBI-20-MM—the first multimodal benchmark for molecular science, comprising 1,263 task instances—and propose a modality transition probability matrix to quantify cross-modal compatibility. We further design an interpretable, locally feature-filtering statistical method to uncover context-dependent knowledge mapping mechanisms. By jointly evaluating SMILES, IUPAC nomenclature, and textual embeddings, our framework systematically identifies optimal modality combinations and, for the first time, enables quantitative attribution of LLMs’ molecular knowledge acquisition preferences. The results establish a theoretical framework and empirical guidelines for model–data co-optimization, advancing trustworthy and interpretable LLM deployment in chemistry.

Technology Category

Application Category

📝 Abstract
Deep learning has significantly advanced molecular modeling and design, enabling efficient understanding and discovery of novel molecules. In particular, large language models (LLMs) introduce a fresh research paradigm to tackle scientific problems from a natural language processing (NLP) perspective. LLMs significantly enhance our understanding and generation of molecules, often surpassing existing methods with their capabilities to decode and synthesize complex molecular patterns. However, two key issues remain: how to quantify the match between model and data modalities and how to identify the knowledge-learning preferences of models. To address these challenges, we propose a multi-modal benchmark, named ChEBI-20-MM, and perform 1263 experiments to assess the model's compatibility with data modalities and knowledge acquisition. Through the modal transition probability matrix, we provide insights into the most suitable modalities for tasks. Furthermore, we introduce a statistically interpretable approach to discover context-specific knowledge mapping by localized feature filtering. Our analysis offers an exploration of the learning mechanism and paves the way for advancing LLMs in molecular science.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Molecular Science
Knowledge Learning Preference
Innovation

Methods, ideas, or system contributions that make the work stand out.

ChEBI-20-MM
Large Language Models
Molecular Science
🔎 Similar Papers
No similar papers found.
P
Pengfei Liu
School of Computer Science and Engineering, Sun Yat-sen University
Jun Tao
Jun Tao
School of Computer Science and Engineering, Sun Yat-sen University
Scientific visualizationuser interface and interactionvisual analyticssoftware visualization
Z
Zhixiang Ren
Peng Cheng Laboratory