🤖 AI Summary
Conventional approaches to materials property prediction suffer from poor generalizability, heavy reliance on handcrafted features, and inability to model complex, high-dimensional materials property spaces. Method: This work introduces MatLLM—the first large-scale, open-source language model tailored for materials science—built upon the LLaMA-7B architecture and trained on 6 million scientific publications alongside multimodal experimental data. It employs domain alignment, instruction fine-tuning, and cross-task prompt engineering to enable natural-language–driven, task-agnostic property prediction and inverse design without task-specific descriptors. Contribution/Results: MatLLM achieves knowledge transfer across 49,256 materials and 21 experimental datasets, outperforming state-of-the-art methods on all eight evaluated materials design tasks, with up to a 59.1% improvement in prediction accuracy. This demonstrates the feasibility and superiority of large language models as universal foundational models for intelligent materials discovery.
📝 Abstract
Materials discovery and design aim to find compositions and structures with desirable properties over highly complex and diverse physical spaces. Traditional solutions, such as high-throughput simulations or machine learning, often rely on complex descriptors, which hinder generalizability and transferability across different material systems. Moreover, These descriptors may inadequately represent macro-scale material properties, which are influenced by structural imperfections and compositional variations in real-world samples, thus limiting their practical applicability. To address these challenges, we propose DARWIN 1.5, the largest open-source large language model tailored for materials science. By leveraging natural language as input, DARWIN eliminates the need for task-specific descriptors and enables a flexible, unified approach to material property prediction and discovery. Our approach integrates 6M material domain papers and 21 experimental datasets from 49,256 materials across modalities while enabling cross-task knowledge transfer. The enhanced model achieves up to 59.1% improvement in prediction accuracy over the base LLaMA-7B architecture and outperforms SOTA machine learning approaches across 8 materials design tasks. These results establish LLMs as a promising foundation for developing versatile and scalable models in materials science.