TOMG-Bench: Evaluating LLMs on Text-based Open Molecule Generation

📅 2024-12-19

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work systematically evaluates large language models (LLMs) on text-driven open molecular generation—encompassing molecular editing, optimization, and customized design. To this end, we introduce TOMG-Bench, the first dedicated benchmark comprising nine subtasks (5,000 samples each) and an automated, RDKit-based multidimensional evaluation framework assessing validity, uniqueness, novelty, and property fidelity. We further propose OpenMolIns, a chemistry-specific instruction-tuning dataset designed to enhance LLMs’ semantic understanding of SMILES and SELFIES representations and improve structural generation accuracy. On TOMG-Bench, Llama3.1-8B fine-tuned with OpenMolIns outperforms all open-source general-purpose LLMs and achieves a 46.5% improvement over GPT-3.5-turbo. All code and datasets are publicly released.

Technology Category

Application Category

📝 Abstract

In this paper, we propose Text-based Open Molecule Generation Benchmark (TOMG-Bench), the first benchmark to evaluate the open-domain molecule generation capability of LLMs. TOMG-Bench encompasses a dataset of three major tasks: molecule editing (MolEdit), molecule optimization (MolOpt), and customized molecule generation (MolCustom). Each major task further contains three subtasks, while each subtask comprises 5,000 test samples. Given the inherent complexity of open molecule generation evaluation, we also developed an automated evaluation system that helps measure both the quality and the accuracy of the generated molecules. Our comprehensive benchmarking of 25 LLMs reveals the current limitations as well as potential areas for improvement in text-guided molecule discovery. Furthermore, we propose OpenMolIns, a specialized instruction tuning dataset established for solving challenges raised by TOMG-Bench. Fine-tuned on OpenMolIns, Llama3.1-8B could outperform all the open-source general LLMs, even surpassing GPT-3.5-turbo by 46.5% on TOMG-Bench. Our codes and datasets are available through https://github.com/phenixace/TOMG-Bench.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs on open-domain molecule generation tasks

Assessing molecule editing, optimization, and customized generation capabilities

Developing automated evaluation for quality and accuracy of molecules

Innovation

Methods, ideas, or system contributions that make the work stand out.

TOMG-Bench evaluates LLMs on molecule generation

Automated system measures molecule quality and accuracy

OpenMolIns dataset enhances LLM performance significantly

🔎 Similar Papers

Can LLMs Generate Diverse Molecules? Towards Alignment with Structural Diversity