๐ค AI Summary
This study addresses the limited capability of existing vision-language models to accurately identify Miller indices (HKL) corresponding to the strongest peaks in powder X-ray diffraction (XRD) patterns and perform crystallographic reasoning. The authors introduce the first multimodal benchmark dataset for XRD peak indexing, comprising 250 samples derived from ten crystallographic databases, which requires models to jointly interpret rendered XRD images and structured textual inputs (e.g., CIF files and chemical formulas) to localize the strongest peak and infer its complete HKL indicesโthereby disentangling visual extraction errors from reasoning failures. Evaluated using Jaccard similarity and exact match accuracy, results across seven state-of-the-art models reveal significant shortcomings: even the best-performing model (GPT-5.4) achieves only a Jaccard score of 0.5888 and 37.6% exact match rate, highlighting critical weaknesses such as bimodal fragility and overprediction in quantitative scientific diagram understanding.
๐ Abstract
Miller-index identification from powder XRD patterns requires capabilities untested by existing multimodal benchmarks: the model must read a narrow peak location from a rendered scientific curve and then connect that observation to multi-step crystallographic reasoning. We introduce CrystalXRD-Bench, a 250-sample benchmark built from 10 public crystallographic databases for a single task: recover the full set of HKLs contributing to the highest-intensity peak in an XRD pattern. Each sample pairs the rendered XRD image with the source CIF text and chemical formula, so visual extraction errors and reasoning errors can be examined side by side. We evaluate seven vision-language models. The best Jaccard score is 0.5888 (GPT-5.4) with an exact-match rate of 37.6%, yet six of seven models remain below Jaccard 0.50; the task is far from solved. Error patterns vary systematically: double-peak cases are especially brittle, recall-heavy models gain coverage by over-predicting HKLs, and access to CIF text does not close the gap in crystallographic calculation. Alongside model rankings, the benchmark identifies the conditions under which current VLMs fail on quantitative scientific figures. All data and evaluation code will be publicly available.