MicroWorld: Empowering Multimodal Large Language Models to Bridge the Microscopic Domain Gap with Multimodal Attribute Graph

📅 2026-05-11
📈 Citations: 0
Influential: 0
📄 PDF

career value

214K/year
🤖 AI Summary
This work addresses the limitations of multimodal large language models (MLLMs) in specialized domains such as microscopy, where performance is hindered by scarce domain-specific data and the absence of fine-grained expert knowledge. To overcome these challenges, the authors propose the Multimodal Attribute Property Graph (MAPG), which for the first time integrates structured knowledge into microscopic reasoning. MAPG extracts entity–relation triples from scientific text–image corpora using either scispaCy or a large language model, aligns multimodal representations via Qwen3-VL-Embedding, and injects structured knowledge during inference through graph-augmented retrieval and structured prompting—without requiring domain-specific fine-tuning. Evaluated on MicroVQA, this approach improves Qwen3-VL-8B-Instruct by 37.5% and surpasses GPT-5 by 13.0%; on MicroBench, it achieves a 6.0% gain, setting a new state of the art and substantially enhancing the model’s generalization and reasoning capabilities in expert domains.
📝 Abstract
Multimodal large language models (MLLMs) show remarkable potential for scientific reasoning, yet their performance in specialized domains such as microscopy remains limited by the scarcity of domain-specific training data and the difficulty of encoding fine-grained expert knowledge into model parameters. To bridge the gap, we introduce MicroWorld, a framework that constructs a multimodal attributed property graph (MAPG) from large-scale scientific image--caption corpora and leverages it to augment MLLM reasoning at inference time without any domain-specific fine-tuning. MicroWorld extracts biomedical entities and relations via scispaCy or LLM-based triplet mining, aligns images and entities in a shared embedding space using Qwen3-VL-Embedding, and assembles a knowledge graph comprising approximately 111K nodes and 346K typed edges spanning eight relation categories. At inference time, a graph-augmented retrieval pipeline matches query entities to the MAPG and injects structured knowledge context into the MLLM prompt. On the MicroVQA benchmark, MicroWorld improves the reasoning performance of Qwen3-VL-8B-Instruct by 37.5%, outperforming GPT-5 by 13.0% to achieve a new state-of-the-art. Furthermore, it yields a 6.0% performance gain on the MicroBench benchmark. Extensive experiments demonstrate the enhanced generalization capability introduced by MicroWorld. A qualitative case study further reveals both the mechanisms through which structured knowledge improves reasoning and the failure modes that point to promising future directions. Code and data are available at https://github.com/ieellee/MicroWorld.
Problem

Research questions and friction points this paper is trying to address.

multimodal large language models
microscopy
domain gap
scientific reasoning
fine-grained knowledge
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Large Language Models
Knowledge Graph
Domain Gap Bridging
Inference-time Augmentation
Scientific Reasoning
🔎 Similar Papers
No similar papers found.
M
Manyu Li
Shanghai Key Laboratory of Intelligent Information Processing, School of Computer Science, Fudan University, Shanghai, China
Ruian He
Ruian He
Fudan University
Image and Video ProcessingComputer VisionMultimodal Language Model
C
Chenxi Ma
Shanghai Key Laboratory of Intelligent Information Processing, School of Computer Science, Fudan University, Shanghai, China
Weimin Tan
Weimin Tan
Fudan University
computer visiondeep learningsaliency detectionsmall object detection and recognition
B
Bo Yan
Shanghai Key Laboratory of Intelligent Information Processing, School of Computer Science, Fudan University, Shanghai, China