From Compound Figures to Composite Understanding: Developing a Multi-Modal LLM from Biomedical Literature with Medical Multiple-Image Benchmarking and Validation

📅 2025-11-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing medical multimodal large language models (MLLMs) primarily support single-image understanding, failing to meet clinical demands for integrating multi-phase or multimodal imaging data, and are hindered by scarcity of high-quality, multi-image annotated datasets. Method: We propose a five-stage, context-aware instruction generation framework that systematically mines 237K licensable composite figures and their textual contexts from biomedical literature. Leveraging visual grounding, image-text alignment, and instruction synthesis, we train M3LLM—a medical-specialized multimodal LLM for multi-image reasoning—and release PMC-MI-Bench, an expert-annotated benchmark for multi-image evaluation. Contribution/Results: M3LLM achieves state-of-the-art performance across multi-image comprehension, single-image reasoning, pure-text understanding, and multiple-choice tasks—outperforming both general-purpose and medical-specific MLLMs. It further demonstrates strong generalization on MIMIC chest X-ray sequence analysis, validating its clinical relevance and robustness.

Technology Category

Application Category

📝 Abstract
Multi-modal large language models (MLLMs) have shown promise in advancing healthcare. However, most existing models remain confined to single-image understanding, which greatly limits their applicability in clinical workflows. In practice, medical diagnosis and progression often require synthesizing information across multiple images from different modalities or time points. The development of medical MLLMs capable of such multi-image understanding has been hindered by the lack of large-scale, high-quality annotated training data. To address this limitation, we propose a novel framework that leverages license-permissive compound images in biomedical literature, as a rich yet underutilized data source for multi-image analysis. Specifically, we design a five-stage, context-aware instruction generation paradigm underpinned by a divide-and-conquer strategy. By decomposing multi-image analysis into manageable sub-tasks, this paradigm empowers MLLMs to move beyond single-panel analysis and provide a composite understanding by learning the complex spatial, temporal, and cross-modal relationships inherent in these compound figures. By parsing over 237,000 compound figures and their contextual text for instruction generation, we develop M3LLM, a medical multi-image multi-modal large language model. For benchmarking, we construct PMC-MI-Bench for composite understanding, manually validated by medical experts. Extensive experiments show that M3LLM significantly outperforms both general-purpose and specialized medical MLLMs across multi-image, single-image, text-only, and multi-choice scenarios. Notably, M3LLM exhibits strong generalization to longitudinal chest X-ray analysis using the MIMIC dataset. This work establishes a scalable and efficient paradigm for developing medical MLLMs capable of composite reasoning, bridging the gap between biomedical literature and real-world clinical applications.
Problem

Research questions and friction points this paper is trying to address.

Develops a medical multi-modal LLM for multi-image composite understanding
Addresses the lack of annotated training data for medical multi-image analysis
Enables composite reasoning across spatial, temporal, and cross-modal medical images
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages compound figures from biomedical literature for training
Uses a five-stage, context-aware instruction generation paradigm
Develops M3LLM model for multi-image composite understanding
🔎 Similar Papers
No similar papers found.
Z
Zhen Chen
Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT 06510, USA
Y
Yihang Fu
Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT 06510, USA
G
Gabriel Madera
Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT 06510, USA; School of Medicine, University of Puerto Rico, San Juan, PR 00921, USA
M
Mauro Giuffre
Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT 06510, USA
S
Serina Applebaum
Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT 06510, USA
Hyunjae Kim
Hyunjae Kim
Yale University
Natural Language ProcessingBiomedical InformaticsHealthcare
H
Hua Xu
Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT 06510, USA
Qingyu Chen
Qingyu Chen
Biomedical Informatics & Data Science, Yale University; NCBI-NLM, National Institutes of Health
Text miningMachine learningData curationBioNLPMedical Imaging Analysis