ChemActor: Enhancing Automated Extraction of Chemical Synthesis Actions with LLM-Generated Data

📅 2025-06-30

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Automated extraction of chemical synthesis steps is hindered by textual ambiguity in scientific literature and the scarcity of high-quality annotated data. To address this, we propose ChemActor—a novel foundation model that achieves end-to-end, precise parsing of unstructured experimental text into machine-executable operation sequences. Methodologically, ChemActor integrates distribution-aware data filtering, a large-language-model-based multi-round iterative review mechanism, and a two-stage task learning paradigm (reaction → description → action), coupled with full-parameter fine-tuning and explicit modeling of machine-executable actions. On the Reaction-to-Description (R2D) and Description-to-Action (D2A) benchmarks—two core tasks for synthesis procedure understanding—ChemActor outperforms prior state-of-the-art methods by 10% absolute gain, substantially improving operational identification accuracy and procedural executability. This establishes ChemActor as a robust foundation model for automating organic synthesis workflows.

Technology Category

Application Category

📝 Abstract

With the increasing interest in robotic synthesis in the context of organic chemistry, the automated extraction of chemical procedures from literature is critical. However, this task remains challenging due to the inherent ambiguity of chemical language and the high cost of human annotation required for developing reliable computer-aided extraction protocols. Here, we present ChemActor, a fully fine-tuned large language model (LLM), as a chemical executor to convert between unstructured experimental procedures and structured action sequences. We propose a sequential LLM-generated data framework to address the challenges of insufficient and low-quality annotated data. This framework integrates a data selection module that selects data based on distribution divergence, with a general-purpose LLM, to generate machine-executable actions from a single molecule input. Additionally, we introduce a novel multi-round LLMs circle review metric, which reflects the model's advanced understanding of chemical experimental procedures. Extensive experiments on reaction-to-description (R2D) and description-to-action (D2A) tasks demonstrate that ChemActor, augmented by LLM-generated data, achieves state-of-the-art performance, outperforming the baseline model by 10%. The code is available at: https://github.com/Zhanghahah/ChemActor.

Problem

Research questions and friction points this paper is trying to address.

Automated extraction of chemical procedures from literature

Addressing ambiguity in chemical language and annotation costs

Generating machine-executable actions from molecule inputs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fully fine-tuned LLM for chemical action conversion

Sequential LLM-generated data framework for annotation

Multi-round LLMs circle review metric

🔎 Similar Papers

An Autonomous Large Language Model Agent for Chemical Literature Data Mining