Thermodynamic Prediction Enabled by Automatic Dataset Building and Machine Learning

📅 2025-07-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In thermochemistry and materials science, experimental determination of thermodynamic properties—such as enthalpies of formation—is costly and yields fragmented, heterogeneous data. To address this bottleneck, we propose an end-to-end predictive framework integrating large language models (LLMs) with machine learning. We introduce LMExt, a novel literature-mining tool that leverages LLMs to automatically extract and harmonize thermodynamic information from unstructured, multi-source scientific literature, enabling efficient construction of a high-quality mineral thermodynamic dataset. Subsequently, we apply gradient-boosting algorithms—including CatBoost—to model structure–property relationships and predict key thermodynamic parameters with high accuracy. Experimental evaluation demonstrates substantial improvements in both data curation efficiency and prediction fidelity, reducing experimental screening costs by orders of magnitude. This work establishes a scalable, automated paradigm for thermochemical database construction and inverse materials design.

Technology Category

Application Category

📝 Abstract
New discoveries in chemistry and materials science, with increasingly expanding volume of requisite knowledge and experimental workload, provide unique opportunities for machine learning (ML) to take critical roles in accelerating research efficiency. Here, we demonstrate (1) the use of large language models (LLMs) for automated literature reviews, and (2) the training of an ML model to predict chemical knowledge (thermodynamic parameters). Our LLM-based literature review tool (LMExt) successfully extracted chemical information and beyond into a machine-readable structure, including stability constants for metal cation-ligand interactions, thermodynamic properties, and other broader data types (medical research papers, and financial reports), effectively overcoming the challenges inherent in each domain. Using the autonomous acquisition of thermodynamic data, an ML model was trained using the CatBoost algorithm for accurately predicting thermodynamic parameters (e.g., enthalpy of formation) of minerals. This work highlights the transformative potential of integrated ML approaches to reshape chemistry and materials science research.
Problem

Research questions and friction points this paper is trying to address.

Automate literature review using large language models
Predict thermodynamic parameters via machine learning
Overcome domain challenges in chemical data extraction
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs automate literature reviews for data extraction
CatBoost algorithm predicts thermodynamic parameters accurately
Machine learning integrates diverse data types effectively
🔎 Similar Papers
No similar papers found.
J
Juejing Liu
Department of Chemistry, Washington State University, Pullman, Washington 99164, United States
H
Haydn Anderson
School of Electrical Engineering & Computer Science, Washington State University, Pullman, Washington 99164, United States
N
Noah I. Waxman
School of Electrical Engineering & Computer Science, Washington State University, Pullman, Washington 99164, United States
V
Vsevolod Kovalev
School of Electrical Engineering & Computer Science, Washington State University, Pullman, Washington 99164, United States
B
Byron Fisher
Department of Chemistry, Washington State University, Pullman, Washington 99164, United States
E
Elizabeth Li
Department of Chemistry, Washington State University, Pullman, Washington 99164, United States
Xiaofeng Guo
Xiaofeng Guo
PhD student, Robotics Institute, Carnegie Mellon University
roboticsmobile manipulationtactile sensinglearning and control