From Compound Figures to Composite Understanding: Developing a Multi-Modal LLM from Biomedical Literature with Medical Multiple-Image Benchmarking and Validation

📅 2025-11-27

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Existing medical multimodal large language models (MLLMs) primarily support single-image understanding, failing to meet clinical demands for integrating multi-phase or multimodal imaging data, and are hindered by scarcity of high-quality, multi-image annotated datasets. Method: We propose a five-stage, context-aware instruction generation framework that systematically mines 237K licensable composite figures and their textual contexts from biomedical literature. Leveraging visual grounding, image-text alignment, and instruction synthesis, we train M3LLM—a medical-specialized multimodal LLM for multi-image reasoning—and release PMC-MI-Bench, an expert-annotated benchmark for multi-image evaluation. Contribution/Results: M3LLM achieves state-of-the-art performance across multi-image comprehension, single-image reasoning, pure-text understanding, and multiple-choice tasks—outperforming both general-purpose and medical-specific MLLMs. It further demonstrates strong generalization on MIMIC chest X-ray sequence analysis, validating its clinical relevance and robustness.

Technology Category

Application Category

📝 Abstract

Multi-modal large language models (MLLMs) have shown promise in advancing healthcare. However, most existing models remain confined to single-image understanding, which greatly limits their applicability in clinical workflows. In practice, medical diagnosis and progression often require synthesizing information across multiple images from different modalities or time points. The development of medical MLLMs capable of such multi-image understanding has been hindered by the lack of large-scale, high-quality annotated training data. To address this limitation, we propose a novel framework that leverages license-permissive compound images in biomedical literature, as a rich yet underutilized data source for multi-image analysis. Specifically, we design a five-stage, context-aware instruction generation paradigm underpinned by a divide-and-conquer strategy. By decomposing multi-image analysis into manageable sub-tasks, this paradigm empowers MLLMs to move beyond single-panel analysis and provide a composite understanding by learning the complex spatial, temporal, and cross-modal relationships inherent in these compound figures. By parsing over 237,000 compound figures and their contextual text for instruction generation, we develop M3LLM, a medical multi-image multi-modal large language model. For benchmarking, we construct PMC-MI-Bench for composite understanding, manually validated by medical experts. Extensive experiments show that M3LLM significantly outperforms both general-purpose and specialized medical MLLMs across multi-image, single-image, text-only, and multi-choice scenarios. Notably, M3LLM exhibits strong generalization to longitudinal chest X-ray analysis using the MIMIC dataset. This work establishes a scalable and efficient paradigm for developing medical MLLMs capable of composite reasoning, bridging the gap between biomedical literature and real-world clinical applications.

Problem

Research questions and friction points this paper is trying to address.

Develops a medical multi-modal LLM for multi-image composite understanding

Addresses the lack of annotated training data for medical multi-image analysis

Enables composite reasoning across spatial, temporal, and cross-modal medical images

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages compound figures from biomedical literature for training

Uses a five-stage, context-aware instruction generation paradigm

Develops M3LLM model for multi-image composite understanding

🔎 Similar Papers

BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs