Text to Band Gap: Pre-trained Language Models as Encoders for Semiconductor Band Gap Prediction

📅 2025-01-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational cost of density functional theory (DFT) calculations and labor-intensive feature engineering in semiconductor bandgap prediction, this work proposes an end-to-end text-based regression method leveraging pretrained natural language processing (NLP) models. Specifically, it directly feeds structured material representations—comprising compositional strings and ChatGPT-generated natural language descriptions—into a RoBERTa encoder, followed by a lightweight, fine-tuned linear regression head. Notably, this is the first study to adopt a general-purpose NLP pretrained model as an off-the-shelf material representation encoder, eliminating the need for domain-specific architectural design or handcrafted feature extraction. With RoBERTa parameters frozen and only the linear head trained, the method achieves a mean absolute error (MAE) of 0.33 eV—substantially outperforming support vector regression (SVR), random forests, and XGBoost. These results demonstrate the strong cross-domain transferability and efficient representation capability of large language models in materials science.

Technology Category

Application Category

📝 Abstract
In this study, we explore the use of a transformer-based language model as an encoder to predict the band gaps of semiconductor materials directly from their text descriptions. Quantum chemistry simulations, including Density Functional Theory (DFT), are computationally intensive and time-consuming, which limits their practicality for high-throughput material screening, particularly for complex systems. Shallow machine learning (ML) models, while effective, often require extensive data preprocessing to convert non-numerical material properties into numerical inputs. In contrast, our approach leverages textual data directly, bypassing the need for complex feature engineering. We generate material descriptions in two formats: formatted strings combining features and natural language text generated using the ChatGPT API. We demonstrate that the RoBERTa model, pre-trained on natural language processing tasks, performs effectively as an encoder for prediction tasks. With minimal fine-tuning, it achieves a mean absolute error (MAE) of approximately 0.33 eV, performing better than shallow machine learning models such as Support Vector Regression, Random Forest, and XGBoost. Even when only the linear regression head is trained while keeping the RoBERTa encoder layers frozen, the accuracy remains nearly identical to that of the fully trained model. This demonstrates that the pre-trained RoBERTa encoder is highly adaptable for processing domain-specific text related to material properties, such as the band gap, significantly reducing the need for extensive retraining. This study highlights the potential of transformer-based language models to serve as efficient and versatile encoders for semiconductor materials property prediction tasks.
Problem

Research questions and friction points this paper is trying to address.

Semiconductor Band Gap Prediction
Quantum Chemistry Simulation
Data Processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pretrained Language Models
Semiconductor Bandgap Prediction
High Precision with Minimal Adjustment
🔎 Similar Papers
No similar papers found.
Y
Ying-Ting Yeh
Department of Chemical Engineering, Carnegie Mellon University, 5000 Forbes Street, Pittsburgh, PA 15213, USA
Janghoon Ock
Janghoon Ock
Assistant Professor, University of Nebraska-Lincoln
Computational CatalysisMaterial DiscoveryAI4Science
A
A. Farimani
Department of Mechanical Engineering, Carnegie Mellon University, 5000 Forbes Street, Pittsburgh, PA 15213, USA