🤖 AI Summary
Material science datasets have long been limited to atomic coordinates (e.g., XYZ files), hindering multimodal modeling and data-driven research. To address this, we introduce MultiCrystalSpectrumSet (MCS-Set), the first standardized multimodal materials benchmark integrating atomic structures, 2D projections, and structured textual annotations (e.g., lattice parameters, coordination numbers). Our contributions include: (1) a human-in-the-loop annotation framework incorporating domain expert knowledge to ensure high-quality labels; (2) a novel controllable crystal generation paradigm under partial cluster supervision; and (3) a cross-modal alignment training strategy jointly evaluated by large language models (LLMs) and vision-language models (VLMs). Experiments reveal significant performance disparities across modalities and demonstrate that annotation quality critically enhances model generalization. The dataset and code are fully open-sourced to advance reproducible and scalable AI-driven materials research.
📝 Abstract
Most materials science datasets are limited to atomic geometries (e.g., XYZ files), restricting their utility for multimodal learning and comprehensive data-centric analysis. These constraints have historically impeded the adoption of advanced machine learning techniques in the field. This work introduces MultiCrystalSpectrumSet (MCS-Set), a curated framework that expands materials datasets by integrating atomic structures with 2D projections and structured textual annotations, including lattice parameters and coordination metrics. MCS-Set enables two key tasks: (1) multimodal property and summary prediction, and (2) constrained crystal generation with partial cluster supervision. Leveraging a human-in-the-loop pipeline, MCS-Set combines domain expertise with standardized descriptors for high-quality annotation. Evaluations using state-of-the-art language and vision-language models reveal substantial modality-specific performance gaps and highlight the importance of annotation quality for generalization. MCS-Set offers a foundation for benchmarking multimodal models, advancing annotation practices, and promoting accessible, versatile materials science datasets. The dataset and implementations are available at https://github.com/KurbanIntelligenceLab/MultiCrystalSpectrumSet.