ChemActor: Enhancing Automated Extraction of Chemical Synthesis Actions with LLM-Generated Data

📅 2025-06-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Automated extraction of chemical synthesis steps is hindered by textual ambiguity in scientific literature and the scarcity of high-quality annotated data. To address this, we propose ChemActor—a novel foundation model that achieves end-to-end, precise parsing of unstructured experimental text into machine-executable operation sequences. Methodologically, ChemActor integrates distribution-aware data filtering, a large-language-model-based multi-round iterative review mechanism, and a two-stage task learning paradigm (reaction → description → action), coupled with full-parameter fine-tuning and explicit modeling of machine-executable actions. On the Reaction-to-Description (R2D) and Description-to-Action (D2A) benchmarks—two core tasks for synthesis procedure understanding—ChemActor outperforms prior state-of-the-art methods by 10% absolute gain, substantially improving operational identification accuracy and procedural executability. This establishes ChemActor as a robust foundation model for automating organic synthesis workflows.

Technology Category

Application Category

📝 Abstract
With the increasing interest in robotic synthesis in the context of organic chemistry, the automated extraction of chemical procedures from literature is critical. However, this task remains challenging due to the inherent ambiguity of chemical language and the high cost of human annotation required for developing reliable computer-aided extraction protocols. Here, we present ChemActor, a fully fine-tuned large language model (LLM), as a chemical executor to convert between unstructured experimental procedures and structured action sequences. We propose a sequential LLM-generated data framework to address the challenges of insufficient and low-quality annotated data. This framework integrates a data selection module that selects data based on distribution divergence, with a general-purpose LLM, to generate machine-executable actions from a single molecule input. Additionally, we introduce a novel multi-round LLMs circle review metric, which reflects the model's advanced understanding of chemical experimental procedures. Extensive experiments on reaction-to-description (R2D) and description-to-action (D2A) tasks demonstrate that ChemActor, augmented by LLM-generated data, achieves state-of-the-art performance, outperforming the baseline model by 10%. The code is available at: https://github.com/Zhanghahah/ChemActor.
Problem

Research questions and friction points this paper is trying to address.

Automated extraction of chemical procedures from literature
Addressing ambiguity in chemical language and annotation costs
Generating machine-executable actions from molecule inputs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fully fine-tuned LLM for chemical action conversion
Sequential LLM-generated data framework for annotation
Multi-round LLMs circle review metric
Y
Yu Zhang
AI Institute, Shanghai Jiao Tong University, China
R
Ruijie Yu
AI Institute, Shanghai Jiao Tong University, China
J
Jidong Tian
AI Institute, Shanghai Jiao Tong University, China
F
Feng Zhu
Frontiers Science Center for Transformative Molecules, Shanghai Jiao Tong University, China
J
Jiapeng Liu
X-Imaging Intelligent Technology (Shanghai) Co. LTD., China
X
Xiaokang Yang
AI Institute, Shanghai Jiao Tong University, China
Yaohui Jin
Yaohui Jin
Shanghai Jiao Tong University
Y
Yanyan Xu
AI Institute, Shanghai Jiao Tong University, China