SCALAR: Quantifying Structural Hallucination, Consistency, and Reasoning Gaps in Materials Foundation Models

📅 2026-01-29

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

Existing foundational models for materials lack systematic evaluation of structural hallucination, reasoning consistency, and generalization under geometric scale variations. This work introduces the SCALAR benchmark, which constructs cross-scale nanostructures via supercell expansion and geometric truncation, and designs three tasks to assess model performance in crystal property prediction, physical reasoning, and inverse retrieval. For the first time, it links geometric scale generalization with structural hallucination and reasoning consistency, establishing a multidimensional evaluation framework that includes hallucination rate, monotonic reasoning, and output validity. Experiments using DFT-validated data and chain-of-thought prompting reveal that while most models reduce hallucination and error under explicit reasoning, they often compromise consistency or output validity, demonstrating that accuracy alone is insufficient to fully capture scale generalization capability.

Technology Category

Application Category

📝 Abstract

Large language models are increasingly applied to materials science reasoning, yet their behavior under physically structured distribution shifts remains poorly understood. We introduce SCALAR (Structural Consistency And Logic Across Regimes), a benchmark for evaluating geometric scale generalization and its connection to structural hallucination, consistency, and reasoning in materials foundation models. Given canonical crystal representations, models must reason about derived nanoparticle structures obtained through supercell expansion and geometric truncation across length scales spanning a few atoms to over 18,000 atoms, totaling $\approx$100,000 structures from DFT-validated unit cells. SCALAR defines three tasks. (i) CIF to property prediction. (ii) A Chain-of-Thought variant with explicit physics-grounded reasoning. (iii) Inverse retrieval identifying crystals from candidates given target properties. Outputs are evaluated via structured metrics capturing numeric error, hallucination, cross-prompt consistency, monotonic reasoning, output validity, and retrieval regret. Experiments across diverse foundation models reveal large, model-dependent shifts under explicit reasoning, often reducing hallucination and error, but frequently destabilizing consistency or validity. These results demonstrate that geometric scale generalization cannot be inferred from accuracy alone. Supplementary materials are available at https://github.com/KurbanIntelligenceLab/SCALAR.

Problem

Research questions and friction points this paper is trying to address.

structural hallucination

geometric scale generalization

materials foundation models

reasoning consistency

crystal structure

Innovation

Methods, ideas, or system contributions that make the work stand out.

geometric scale generalization

structural hallucination

materials foundation models