An Autonomous Large Language Model Agent for Chemical Literature Data Mining

📅 2024-02-20

🏛️ arXiv.org

📈 Citations: 5

✨ Influential: 0

career value

202K/year

🤖 AI Summary

High-throughput experimental data acquisition remains costly and narrowly scoped, while vast chemical literature—though rich in reaction information—is hindered by heterogeneous writing styles, complex coreference, and multimodal representations, impeding reliable structural extraction. To address this, we propose the first end-to-end large language model (LLM)-based autonomous agent framework tailored for chemistry, integrating domain-specific knowledge constraints, dynamic prompt generation, iterative reasoning, and self-refinement mechanisms. This framework enables fully automated, high-fidelity extraction of critical reaction conditions—including catalysts, solvents, temperature, and time—from unstructured text. On reaction condition extraction, it achieves expert-level performance (>92% accuracy, recall, and F1-score), reduces inference latency by over 80%, and substantially outperforms existing baseline methods.

Technology Category

Application Category

📝 Abstract

Chemical synthesis, which is crucial for advancing material synthesis and drug discovery, impacts various sectors including environmental science and healthcare. The rise of technology in chemistry has generated extensive chemical data, challenging researchers to discern patterns and refine synthesis processes. Artificial intelligence (AI) helps by analyzing data to optimize synthesis and increase yields. However, AI faces challenges in processing literature data due to the unstructured format and diverse writing style of chemical literature. To overcome these difficulties, we introduce an end-to-end AI agent framework capable of high-fidelity extraction from extensive chemical literature. This AI agent employs large language models (LLMs) for prompt generation and iterative optimization. It functions as a chemistry assistant, automating data collection and analysis, thereby saving manpower and enhancing performance. Our framework's efficacy is evaluated using accuracy, recall, and F1 score of reaction condition data, and we compared our method with human experts in terms of content correctness and time efficiency. The proposed approach marks a significant advancement in automating chemical literature extraction and demonstrates the potential for AI to revolutionize data management and utilization in chemistry.

Problem

Research questions and friction points this paper is trying to address.

Extracting chemical reaction data from diverse literature formats

Overcoming varied writing styles and complex coreference relationships

Integrating multimodal information for comprehensive data mining

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-powered multi-agent system for data extraction

Specialized agents handle text and multimodal data

Open-sourced benchmark for chemical literature mining

🔎 Similar Papers

No similar papers found.

Genentech

New York City, New York, United States of America / South San Francisco, California, United States of America

AI Research Scientist — Agentic AI for Materials Discovery