Invariant Tokenization of Crystalline Materials for Language Model Enabled Generation

📅 2025-02-28
🏛️ Neural Information Processing Systems
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing one-dimensional sequence representations of three-dimensional crystal structures lack invariance under SE(3) transformations and translational periodicity—critical properties for physically meaningful and unique encoding. Method: We propose Mat2Seq, a novel representation learning framework that performs geometry normalization based on space-group symmetry and jointly encodes atomic numbers with relative coordinates. This yields the first strictly SE(3)- and translationally periodic-invariant sequence representation: mathematically equivalent crystals are provably mapped to identical sequences. Contribution/Results: The resulting representation is natively compatible with language models, enabling end-to-end sequence-based modeling and generation. Experiments demonstrate that Mat2Seq significantly outperforms baselines—including CIF flows—in structural validity, diversity, and inverse design accuracy. It achieves state-of-the-art performance across multiple crystal generation benchmarks, establishing a new foundation for sequence-based crystalline materials discovery.

Technology Category

Application Category

📝 Abstract
We consider the problem of crystal materials generation using language models (LMs). A key step is to convert 3D crystal structures into 1D sequences to be processed by LMs. Prior studies used the crystallographic information framework (CIF) file stream, which fails to ensure SE(3) and periodic invariance and may not lead to unique sequence representations for a given crystal structure. Here, we propose a novel method, known as Mat2Seq, to tackle this challenge. Mat2Seq converts 3D crystal structures into 1D sequences and ensures that different mathematical descriptions of the same crystal are represented in a single unique sequence, thereby provably achieving SE(3) and periodic invariance. Experimental results show that, with language models, Mat2Seq achieves promising performance in crystal structure generation as compared with prior methods.
Problem

Research questions and friction points this paper is trying to address.

Convert 3D crystal structures into 1D sequences for language models.
Ensure SE(3) and periodic invariance in crystal structure representation.
Achieve unique sequence representations for identical crystal structures.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mat2Seq converts 3D crystals to 1D sequences
Ensures SE(3) and periodic invariance in sequences
Achieves unique sequence representations for crystals
🔎 Similar Papers
2024-02-06International Conference on Learning RepresentationsCitations: 67