MedM-VL: What Makes a Good Medical LVLM?

📅 2025-04-06

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

To address the growing demand for complex, multimodal medical image analysis beyond single-task paradigms, this work proposes a systematic framework for building clinically scalable Large Vision-Language Models (LVLMs) for medicine. Methodologically, it extends the LLaVA paradigm by integrating domain-specific vision encoders (medical ViT/CNN), cross-modal adapters, and open-source LLMs (e.g., Llama), supporting multi-resolution 2D/3D inputs—including chest CT—and enabling domain-adaptive fine-tuning. Key contributions include: (1) the first comprehensive set of architectural design principles for medical LVLMs; (2) the release of two task-specialized bilingual models—MedM-VL-2D and MedM-VL-CT-Chest; and (3) the open-sourcing of the first modular, fine-tunable medical vision-language foundation model codebase. Evaluated on medical visual question answering and radiology report generation, the models achieve state-of-the-art or near-state-of-the-art performance.

Technology Category

Application Category

📝 Abstract

Medical image analysis is a fundamental component. As deep learning progresses, the focus has shifted from single-task applications, such as classification and segmentation, to more complex multimodal tasks, including medical visual question answering and report generation. Traditional shallow and task-specific models are increasingly limited in addressing the complexity and scalability required in clinical practice. The emergence of large language models (LLMs) has driven the development of medical Large Vision-Language Models (LVLMs), offering a unified solution for diverse vision-language tasks. In this study, we investigate various architectural designs for medical LVLMs based on the widely adopted LLaVA framework, which follows an encoder-connector-LLM paradigm. We construct two distinct models targeting 2D and 3D modalities, respectively. These models are designed to support both general-purpose medical tasks and domain-specific fine-tuning, thereby serving as effective foundation models. To facilitate reproducibility and further research, we develop a modular and extensible codebase, MedM-VL, and release two LVLM variants: MedM-VL-2D for 2D medical image analysis and MedM-VL-CT-Chest for 3D CT-based applications. The code and models are available at: https://github.com/MSIIP/MedM-VL

Problem

Research questions and friction points this paper is trying to address.

Developing medical LVLMs for multimodal tasks

Addressing limitations of traditional task-specific models

Creating foundation models for 2D and 3D modalities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Develops medical LVLMs based on LLaVA framework

Supports 2D and 3D modalities separately

Provides modular codebase for medical tasks

🔎 Similar Papers

No similar papers found.