Automatic Annotation Augmentation Boosts Translation between Molecules and Natural Language

📅 2025-02-10

📈 Citations: 0

✨ Influential: 0

career value

150K/year

🤖 AI Summary

High-quality annotated data scarcity hinders molecular–language cross-modal modeling in biomedicine. Method: We propose LA³, a large language model (LLM)-driven automatic annotation augmentation framework, introducing a novel LLM-prompted annotation rewriting paradigm that generates semantically consistent, syntactically diverse, and lexically rich molecular descriptions—compatible with multimodal molecular representations (e.g., SMILES, InChI). Leveraging LA³, we construct LaChEBI-20, the first large-scale augmented molecular semantic dataset, and train LaMolT5—a T5-based molecular language model supporting molecular generation, description, and cross-modal (text/image/graph) transfer tasks. Results: LaMolT5 achieves state-of-the-art performance across multiple molecular language benchmarks, outperforming prior methods by up to 301% relative improvement. Ablation studies confirm the generalizability of our augmentation strategy across heterogeneous modalities.

Technology Category

Application Category

📝 Abstract

Recent advancements in AI for biological research focus on integrating molecular data with natural language to accelerate drug discovery. However, the scarcity of high-quality annotations limits progress in this area. This paper introduces LA$^3$, a Language-based Automatic Annotation Augmentation framework that leverages large language models to augment existing datasets, thereby improving AI training. We demonstrate the effectiveness of LA$^3$ by creating an enhanced dataset, LaChEBI-20, where we systematically rewrite the annotations of molecules from an established dataset. These rewritten annotations preserve essential molecular information while providing more varied sentence structures and vocabulary. Using LaChEBI-20, we train LaMolT5 based on a benchmark architecture to learn the mapping between molecular representations and augmented annotations. Experimental results on text-based *de novo* molecule generation and molecule captioning demonstrate that LaMolT5 outperforms state-of-the-art models. Notably, incorporating LA$^3$ leads to improvements of up to 301% over the benchmark architecture. Furthermore, we validate the effectiveness of LA$^3$ notable applications in *image*, *text* and *graph* tasks, affirming its versatility and utility.

Problem

Research questions and friction points this paper is trying to address.

Enhances molecular-natural language translation

Augments datasets for AI training

Improves drug discovery through AI

Innovation

Methods, ideas, or system contributions that make the work stand out.

LA$^3$ augments molecular datasets

LaMolT5 enhances molecule-language mapping

LA$^3$ improves AI training benchmarks

🔎 Similar Papers

Large Language Models are In-Context Molecule Learners