🤖 AI Summary
Current catalytic adsorption configuration energy prediction models (e.g., CatBERTa, GAP-CATBERTa) suffer from limited accuracy and poor configurational discrimination, undermining the reliability of machine learning–driven catalyst screening. To address this, we propose a graph–language深度融合 multimodal foundation model featuring a novel graph–text alignment mechanism that explicitly injects 3D geometric information into the language pathway. By integrating the Qwen large language model with the E(3)-equivariant graph transformer Equiformer-V2, our model jointly encodes atomic-scale 3D structures and structured textual representations. It simultaneously supports high-accuracy adsorption relaxation energy prediction and autoregressive CIF file generation. On the OC20 dataset, our model achieves a mean absolute error (MAE) of 0.486 eV for relaxed adsorption energy prediction—substantially outperforming existing baselines. This work establishes a new paradigm for inverse catalytic design grounded in unified multimodal representation learning.
📝 Abstract
Adsorption energy is a key descriptor of catalytic reactivity. It is fundamentally defined as the difference between the relaxed total energy of the adsorbate-surface system and that of an appropriate reference state; therefore, the accuracy of relaxed-energy prediction directly determines the reliability of machine-learning-driven catalyst screening. E(3)-equivariant graph neural networks (GNNs) can natively operate on three-dimensional atomic coordinates under periodic boundary conditions and have demonstrated strong performance on such tasks. In contrast, language-model-based approaches, while enabling human-readable textual descriptions and reducing reliance on explicit graph -- thereby broadening applicability -- remain insufficient in both adsorption-configuration energy prediction accuracy and in distinguishing ``the same system with different configurations,'' even with graph-assisted pretraining in the style of GAP-CATBERTa.
To this end, we propose QE-Catalytic, a multimodal framework that deeply couples a large language model ( extbf{Q}wen) with an E(3)-equivariant graph Transformer ( extbf{E}quiformer-V2), enabling unified support for adsorption-configuration property prediction and inverse design on complex catalytic surfaces. During prediction, QE-Catalytic jointly leverages three-dimensional structures and structured configuration text, and injects ``3D geometric information'' into the language channel via graph-text alignment, allowing it to function as a high-performance text-based predictor when precise coordinates are unavailable, while also autoregressively generating CIF files for target-energy-driven structure design and information completion. On OC20, QE-Catalytic reduces the MAE of relaxed adsorption energy from 0.713~eV to 0.486~eV, and consistently outperforms baseline models such as CatBERTa and GAP-CATBERTa across multiple evaluation protocols.