All that structure matches does not glitter

📅 2025-09-15

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

This paper addresses three critical issues in evaluating inorganic crystal structure generation models: dataset duplication, improper data splitting, and misleading evaluation metrics. Methodologically: (1) it establishes a standardized dataset construction protocol featuring crystal structure deduplication and atom-number–stratified train/validation/test splits, explicitly identifying and removing isomorphic but non-identical unit cells; (2) it introduces two structure-aware metrics—METRe (Matching Error Rate) and cRMSE (normalized Crystal Root Mean Square Error)—to mitigate the over-sensitivity of conventional matching rates to local structural similarities. Contributions include releasing the revised carbon-24 dataset series and the perov-5 splitting protocol, which collectively enhance the rigor, reproducibility, and scientific validity of benchmarking. These improvements lay a methodological foundation for trustworthy evaluation of generative models in materials discovery.

Technology Category

Application Category

📝 Abstract

Generative models for materials, especially inorganic crystals, hold potential to transform the theoretical prediction of novel compounds and structures. Advancement in this field depends critically on robust benchmarks and minimal, information-rich datasets that enable meaningful model evaluation. This paper critically examines common datasets and reported metrics for a crystal structure prediction task$unicode{x2014}$generating the most likely structures given the chemical composition of a material. We focus on three key issues: First, materials datasets should contain unique crystal structures; for example, we show that the widely-utilized carbon-24 dataset only contains $approx$40% unique structures. Second, materials datasets should not be split randomly if polymorphs of many different compositions are numerous, which we find to be the case for the perov-5 dataset. Third, benchmarks can mislead if used uncritically, e.g., reporting a match rate metric without considering the structural variety exhibited by identical building blocks. To address these oft-overlooked issues, we introduce several fixes. We provide revised versions of the carbon-24 dataset: one with duplicates removed, one deduplicated and split by number of atoms $N$, and two containing only identical structures but with different unit cells. We also propose a new split for the perov-5 dataset which ensures polymorphs are grouped within each split subset, setting a more sensible standard for benchmarking model performance. Finally, we present METRe and cRMSE, new model evaluation metrics that can correct existing issues with the match rate metric.

Problem

Research questions and friction points this paper is trying to address.

Evaluating uniqueness and deduplication in crystal structure datasets

Addressing improper dataset splitting for polymorph-rich materials

Developing improved metrics for generative model evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Deduplicate datasets to ensure unique crystal structures

Propose polymorph-aware dataset splits for benchmarking

Introduce new metrics METRe and cRMSE

🔎 Similar Papers

MatText: Do Language Models Need More than Text & Scale for Materials Modeling?

2024-06-25arXiv.orgCitations: 10

Apple

Seattle, United States of America

AI Research Scientist — Agentic AI for Materials Discovery