DARWIN 1.5: Large Language Models as Materials Science Adapted Learners

📅 2024-12-16
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Conventional approaches to materials property prediction suffer from poor generalizability, heavy reliance on handcrafted features, and inability to model complex, high-dimensional materials property spaces. Method: This work introduces MatLLM—the first large-scale, open-source language model tailored for materials science—built upon the LLaMA-7B architecture and trained on 6 million scientific publications alongside multimodal experimental data. It employs domain alignment, instruction fine-tuning, and cross-task prompt engineering to enable natural-language–driven, task-agnostic property prediction and inverse design without task-specific descriptors. Contribution/Results: MatLLM achieves knowledge transfer across 49,256 materials and 21 experimental datasets, outperforming state-of-the-art methods on all eight evaluated materials design tasks, with up to a 59.1% improvement in prediction accuracy. This demonstrates the feasibility and superiority of large language models as universal foundational models for intelligent materials discovery.

Technology Category

Application Category

📝 Abstract
Materials discovery and design aim to find compositions and structures with desirable properties over highly complex and diverse physical spaces. Traditional solutions, such as high-throughput simulations or machine learning, often rely on complex descriptors, which hinder generalizability and transferability across different material systems. Moreover, These descriptors may inadequately represent macro-scale material properties, which are influenced by structural imperfections and compositional variations in real-world samples, thus limiting their practical applicability. To address these challenges, we propose DARWIN 1.5, the largest open-source large language model tailored for materials science. By leveraging natural language as input, DARWIN eliminates the need for task-specific descriptors and enables a flexible, unified approach to material property prediction and discovery. Our approach integrates 6M material domain papers and 21 experimental datasets from 49,256 materials across modalities while enabling cross-task knowledge transfer. The enhanced model achieves up to 59.1% improvement in prediction accuracy over the base LLaMA-7B architecture and outperforms SOTA machine learning approaches across 8 materials design tasks. These results establish LLMs as a promising foundation for developing versatile and scalable models in materials science.
Problem

Research questions and friction points this paper is trying to address.

Materials Science
Predictive Modeling
Machine Learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

DARWIN 1.5
Materials Science
Language Model
Tong Xie
Tong Xie
Green Dynamics & University of New South Wales
Solar CellsLarge Language ModelsCheminformaticsNano Materials
Y
Yuwei Wan
GreenDynamics, Sydney, NSW, Australia; Department of Linguistics and Translation, City University of Hong Kong, Hong Kong, China
Yixuan Liu
Yixuan Liu
AMD, Tsinghua University
Generative AI
Yuchen Zeng
Yuchen Zeng
Microsoft Research
Machine LearningArtificial IntelligenceAlgorithms
W
Wenjie Zhang
School of Computer Science and Engineering, University of New South Wales, Kensington, NSW, Australia
Chunyu Kit
Chunyu Kit
City University of Hong Kong
Computational linguistics
Dongzhan Zhou
Dongzhan Zhou
Researcher at Shanghai AI Lab
AI4Sciencecomputer visiondeep learning
B
B. Hoex
School of Photovoltaic and Renewable Energy Engineering, University of New South Wales, Kensington, NSW, Australia