BioMedGPT-Mol: Multi-task Learning for Molecular Understanding and Generation

📅 2025-12-04

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This study addresses the critical challenge of efficiently adapting general-purpose large language models (LLMs) to molecular science. We propose a unified multi-task curriculum learning framework, leveraging a high-quality, self-constructed molecular instruction dataset to post-train general reasoning models—jointly optimizing molecular structure understanding, property prediction, and generation. Notably, we are the first to successfully extend this paradigm to end-to-end retrosynthetic planning, achieving performance on RetroBench comparable to domain-specific methods. Our model establishes new state-of-the-art results across three major benchmarks—LlaSMol, TOMG-Bench, and MuMOInstruct—demonstrating strong cross-task knowledge fusion and generalization. Key contributions include: (1) introducing the first instruction-tuning paradigm specifically designed for molecular science; (2) empirically validating the efficacy of multi-task curriculum learning for chemical LLM adaptation; and (3) significantly expanding its applicability to complex molecular reasoning tasks, notably retrosynthesis.

Technology Category

Application Category

📝 Abstract

Molecules play a crucial role in biomedical research and discovery, particularly in the field of small molecule drug development. Given the rapid advancements in large language models, especially the recent emergence of reasoning models, it is natural to explore how a general-purpose language model can be efficiently adapted for molecular science applications. In this work, we introduce BioMedGPT-Mol, a molecular language model designed to support molecular understanding and generation tasks. By curating and unifying existing public instruction datasets, we have assembled a large-scale, comprehensive, and high-quality training dataset. The model is then fine-tuned through a meticulously designed multi-task learning framework. On a consolidated benchmark derived from LlaSMol, TOMG-Bench, and MuMOInstruct, BioMedGPT-Mol achieves remarkable performance. Our experimental results demonstrate that a general-purpose reasoning model can be effectively and efficiently post-trained into a professional molecular language model through a well-structured multi-task curriculum. Leveraging the power of it, we further explore retrosynthetic planning task, and the performance on RetroBench demonstrates its competitive capability of acting as an end-to-end retrosynthetic planner. We anticipate that our approach can be extended to other biomedical scientific domains.

Problem

Research questions and friction points this paper is trying to address.

Adapts general-purpose language models for molecular science tasks

Fine-tunes model using multi-task learning on curated molecular datasets

Enables molecular understanding, generation, and retrosynthetic planning applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-task learning framework for molecular tasks

Fine-tuning general reasoning models into molecular models

Large-scale curated instruction dataset for training

🔎 Similar Papers

3D-MolT5: Towards Unified 3D Molecule-Text Modeling with 3D Molecular Tokenization