🤖 AI Summary
This work addresses the limited capability of vision-language models (VLMs) to perceive and reason about spatial relationships among unseen microscopic entities—particularly molecules—despite their growing role in scientific AI. Method: We introduce Microscopic Spatial Intelligence (MiSI), a novel paradigm, and propose MiSI-Bench—the first molecular-scale vision-language benchmark—comprising 163K question-answer pairs and 587K multi-view rendered images, covering nine scientific spatial reasoning tasks. Our approach uniquely generates physically plausible 3D-molecule-based images, integrates spatial relation annotations, encodes scientific constraints, and performs VLM fine-tuning. Contribution/Results: Experiments reveal that state-of-the-art VLMs substantially underperform humans overall. A fine-tuned 7B VLM surpasses human accuracy on spatial transformation tasks but lags significantly on knowledge-intensive tasks like hydrogen-bond identification, empirically validating the necessity and efficacy of domain-knowledge enhancement for microscopic spatial intelligence.
📝 Abstract
This paper introduces the concept of Microscopic Spatial Intelligence (MiSI), the capability to perceive and reason about the spatial relationships of invisible microscopic entities, which is fundamental to scientific discovery. To assess the potential of Vision-Language Models (VLMs) in this domain, we propose a systematic benchmark framework MiSI-Bench. This framework features over 163,000 question-answer pairs and 587,000 images derived from approximately 4,000 molecular structures, covering nine complementary tasks that evaluate abilities ranging from elementary spatial transformations to complex relational identifications. Experimental results reveal that current state-of-the-art VLMs perform significantly below human level on this benchmark. However, a fine-tuned 7B model demonstrates substantial potential, even surpassing humans in spatial transformation tasks, while its poor performance in scientifically-grounded tasks like hydrogen bond recognition underscores the necessity of integrating explicit domain knowledge for progress toward scientific AGI. The datasets are available at https://huggingface.co/datasets/zongzhao/MiSI-bench.