🤖 AI Summary
Biomedical microscopic image reasoning is hindered by the scarcity of high-quality multimodal training data. To address this, we propose HiCQA-Graph—a novel framework that constructs, for the first time, a heterogeneous graph integrating images, captions, and question-answer pairs. It jointly leverages natural language inference (NLI) textual entailment, CLIP-based image-text alignment, and proxy signals for cross-modal consistency filtering. Leveraging an expert literature-driven curation pipeline—comprising graph-structured proposition selection and rigorous human verification—we build a large-scale, high-quality microscopic visual question answering (VQA) dataset. Its test set features strictly human-validated samples, with Bloom-hard instances significantly exceeding those in the MicroVQA benchmark. Evaluated on a 4B-parameter open-source multimodal large language model (MLLM), our method achieves microscopic reasoning performance comparable to GPT-5, establishing new state-of-the-art results among open-source models.
📝 Abstract
Multimodal Large Language Models are increasingly applied to biomedical imaging, yet scientific reasoning for microscopy remains limited by the scarcity of large-scale, high-quality training data. We introduce MicroVQA++, a three-stage, large-scale and high-quality microscopy VQA corpus derived from the BIOMEDICA archive. Stage one bootstraps supervision from expert-validated figure-caption pairs sourced from peer-reviewed articles. Stage two applies HiCQA-Graph, a novel heterogeneous graph over images, captions, and QAs that fuses NLI-based textual entailment, CLIP-based vision-language alignment, and agent signals to identify and filter inconsistent samples. Stage three uses a MultiModal Large Language Model (MLLM) agent to generate multiple-choice questions (MCQ) followed by human screening. The resulting release comprises a large training split and a human-checked test split whose Bloom's level hard-sample distribution exceeds the MicroVQA benchmark. Our work delivers (i) a quality-controlled dataset that couples expert literature with graph-based filtering and human refinement; (ii) HiCQA-Graph, the first graph that jointly models (image, caption, QA) for cross-modal consistency filtering; (iii) evidence that careful data construction enables 4B-scale MLLMs to reach competitive microscopy reasoning performance (e.g., GPT-5) and achieve state-of-the-art performance among open-source MLLMs. Code and dataset will be released after the review process concludes.