From Macro to Micro: Benchmarking Microscopic Spatial Intelligence on Molecules via Vision-Language Models

📅 2025-12-11

📈 Citations: 0

✨ Influential: 0

career value

243K/year

🤖 AI Summary

This work addresses the limited capability of vision-language models (VLMs) to perceive and reason about spatial relationships among unseen microscopic entities—particularly molecules—despite their growing role in scientific AI. Method: We introduce Microscopic Spatial Intelligence (MiSI), a novel paradigm, and propose MiSI-Bench—the first molecular-scale vision-language benchmark—comprising 163K question-answer pairs and 587K multi-view rendered images, covering nine scientific spatial reasoning tasks. Our approach uniquely generates physically plausible 3D-molecule-based images, integrates spatial relation annotations, encodes scientific constraints, and performs VLM fine-tuning. Contribution/Results: Experiments reveal that state-of-the-art VLMs substantially underperform humans overall. A fine-tuned 7B VLM surpasses human accuracy on spatial transformation tasks but lags significantly on knowledge-intensive tasks like hydrogen-bond identification, empirically validating the necessity and efficacy of domain-knowledge enhancement for microscopic spatial intelligence.

Technology Category

Application Category

📝 Abstract

This paper introduces the concept of Microscopic Spatial Intelligence (MiSI), the capability to perceive and reason about the spatial relationships of invisible microscopic entities, which is fundamental to scientific discovery. To assess the potential of Vision-Language Models (VLMs) in this domain, we propose a systematic benchmark framework MiSI-Bench. This framework features over 163,000 question-answer pairs and 587,000 images derived from approximately 4,000 molecular structures, covering nine complementary tasks that evaluate abilities ranging from elementary spatial transformations to complex relational identifications. Experimental results reveal that current state-of-the-art VLMs perform significantly below human level on this benchmark. However, a fine-tuned 7B model demonstrates substantial potential, even surpassing humans in spatial transformation tasks, while its poor performance in scientifically-grounded tasks like hydrogen bond recognition underscores the necessity of integrating explicit domain knowledge for progress toward scientific AGI. The datasets are available at https://huggingface.co/datasets/zongzhao/MiSI-bench.

Problem

Research questions and friction points this paper is trying to address.

Benchmarking VLMs on microscopic spatial intelligence tasks

Assessing VLMs' ability to perceive molecular spatial relationships

Evaluating VLMs on complex scientific reasoning like hydrogen bonds

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark framework for microscopic spatial intelligence evaluation

Fine-tuned 7B model excels in spatial transformation tasks

Integration of domain knowledge needed for scientific tasks

🔎 Similar Papers

3D-MolT5: Towards Unified 3D Molecule-Text Modeling with 3D Molecular Tokenization

2024-06-09arXiv.orgCitations: 1