🤖 AI Summary
Existing one-dimensional sequence representations of three-dimensional crystal structures lack invariance under SE(3) transformations and translational periodicity—critical properties for physically meaningful and unique encoding.
Method: We propose Mat2Seq, a novel representation learning framework that performs geometry normalization based on space-group symmetry and jointly encodes atomic numbers with relative coordinates. This yields the first strictly SE(3)- and translationally periodic-invariant sequence representation: mathematically equivalent crystals are provably mapped to identical sequences.
Contribution/Results: The resulting representation is natively compatible with language models, enabling end-to-end sequence-based modeling and generation. Experiments demonstrate that Mat2Seq significantly outperforms baselines—including CIF flows—in structural validity, diversity, and inverse design accuracy. It achieves state-of-the-art performance across multiple crystal generation benchmarks, establishing a new foundation for sequence-based crystalline materials discovery.
📝 Abstract
We consider the problem of crystal materials generation using language models (LMs). A key step is to convert 3D crystal structures into 1D sequences to be processed by LMs. Prior studies used the crystallographic information framework (CIF) file stream, which fails to ensure SE(3) and periodic invariance and may not lead to unique sequence representations for a given crystal structure. Here, we propose a novel method, known as Mat2Seq, to tackle this challenge. Mat2Seq converts 3D crystal structures into 1D sequences and ensures that different mathematical descriptions of the same crystal are represented in a single unique sequence, thereby provably achieving SE(3) and periodic invariance. Experimental results show that, with language models, Mat2Seq achieves promising performance in crystal structure generation as compared with prior methods.