🤖 AI Summary
To address the high computational cost of density functional theory (DFT) calculations and labor-intensive feature engineering in semiconductor bandgap prediction, this work proposes an end-to-end text-based regression method leveraging pretrained natural language processing (NLP) models. Specifically, it directly feeds structured material representations—comprising compositional strings and ChatGPT-generated natural language descriptions—into a RoBERTa encoder, followed by a lightweight, fine-tuned linear regression head. Notably, this is the first study to adopt a general-purpose NLP pretrained model as an off-the-shelf material representation encoder, eliminating the need for domain-specific architectural design or handcrafted feature extraction. With RoBERTa parameters frozen and only the linear head trained, the method achieves a mean absolute error (MAE) of 0.33 eV—substantially outperforming support vector regression (SVR), random forests, and XGBoost. These results demonstrate the strong cross-domain transferability and efficient representation capability of large language models in materials science.
📝 Abstract
In this study, we explore the use of a transformer-based language model as an encoder to predict the band gaps of semiconductor materials directly from their text descriptions. Quantum chemistry simulations, including Density Functional Theory (DFT), are computationally intensive and time-consuming, which limits their practicality for high-throughput material screening, particularly for complex systems. Shallow machine learning (ML) models, while effective, often require extensive data preprocessing to convert non-numerical material properties into numerical inputs. In contrast, our approach leverages textual data directly, bypassing the need for complex feature engineering. We generate material descriptions in two formats: formatted strings combining features and natural language text generated using the ChatGPT API. We demonstrate that the RoBERTa model, pre-trained on natural language processing tasks, performs effectively as an encoder for prediction tasks. With minimal fine-tuning, it achieves a mean absolute error (MAE) of approximately 0.33 eV, performing better than shallow machine learning models such as Support Vector Regression, Random Forest, and XGBoost. Even when only the linear regression head is trained while keeping the RoBERTa encoder layers frozen, the accuracy remains nearly identical to that of the fully trained model. This demonstrates that the pre-trained RoBERTa encoder is highly adaptable for processing domain-specific text related to material properties, such as the band gap, significantly reducing the need for extensive retraining. This study highlights the potential of transformer-based language models to serve as efficient and versatile encoders for semiconductor materials property prediction tasks.