MedM-VL: What Makes a Good Medical LVLM?

📅 2025-04-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the growing demand for complex, multimodal medical image analysis beyond single-task paradigms, this work proposes a systematic framework for building clinically scalable Large Vision-Language Models (LVLMs) for medicine. Methodologically, it extends the LLaVA paradigm by integrating domain-specific vision encoders (medical ViT/CNN), cross-modal adapters, and open-source LLMs (e.g., Llama), supporting multi-resolution 2D/3D inputs—including chest CT—and enabling domain-adaptive fine-tuning. Key contributions include: (1) the first comprehensive set of architectural design principles for medical LVLMs; (2) the release of two task-specialized bilingual models—MedM-VL-2D and MedM-VL-CT-Chest; and (3) the open-sourcing of the first modular, fine-tunable medical vision-language foundation model codebase. Evaluated on medical visual question answering and radiology report generation, the models achieve state-of-the-art or near-state-of-the-art performance.

Technology Category

Application Category

📝 Abstract
Medical image analysis is a fundamental component. As deep learning progresses, the focus has shifted from single-task applications, such as classification and segmentation, to more complex multimodal tasks, including medical visual question answering and report generation. Traditional shallow and task-specific models are increasingly limited in addressing the complexity and scalability required in clinical practice. The emergence of large language models (LLMs) has driven the development of medical Large Vision-Language Models (LVLMs), offering a unified solution for diverse vision-language tasks. In this study, we investigate various architectural designs for medical LVLMs based on the widely adopted LLaVA framework, which follows an encoder-connector-LLM paradigm. We construct two distinct models targeting 2D and 3D modalities, respectively. These models are designed to support both general-purpose medical tasks and domain-specific fine-tuning, thereby serving as effective foundation models. To facilitate reproducibility and further research, we develop a modular and extensible codebase, MedM-VL, and release two LVLM variants: MedM-VL-2D for 2D medical image analysis and MedM-VL-CT-Chest for 3D CT-based applications. The code and models are available at: https://github.com/MSIIP/MedM-VL
Problem

Research questions and friction points this paper is trying to address.

Developing medical LVLMs for multimodal tasks
Addressing limitations of traditional task-specific models
Creating foundation models for 2D and 3D modalities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Develops medical LVLMs based on LLaVA framework
Supports 2D and 3D modalities separately
Provides modular codebase for medical tasks
🔎 Similar Papers
No similar papers found.
Yiming Shi
Yiming Shi
University of Electronic Science and Technology of China
Efficient AIParameter Efficient Fine TuningDiffusionMultimodal
S
Shaoshuai Yang
Department of Electronic Engineering, Tsinghua University, Beijing, China
X
Xun Zhu
Department of Electronic Engineering, Tsinghua University, Beijing, China
H
Haoyu Wang
Department of Electronic Engineering, Tsinghua University, Beijing, China
M
Miao Li
Department of Electronic Engineering, Tsinghua University, Beijing, China
Ji Wu
Ji Wu
Tsinghua University
Artificial Intelligence,smart healthcaremachine learningpattern recognitionspeech recognition