🤖 AI Summary
To address the growing demand for complex, multimodal medical image analysis beyond single-task paradigms, this work proposes a systematic framework for building clinically scalable Large Vision-Language Models (LVLMs) for medicine. Methodologically, it extends the LLaVA paradigm by integrating domain-specific vision encoders (medical ViT/CNN), cross-modal adapters, and open-source LLMs (e.g., Llama), supporting multi-resolution 2D/3D inputs—including chest CT—and enabling domain-adaptive fine-tuning. Key contributions include: (1) the first comprehensive set of architectural design principles for medical LVLMs; (2) the release of two task-specialized bilingual models—MedM-VL-2D and MedM-VL-CT-Chest; and (3) the open-sourcing of the first modular, fine-tunable medical vision-language foundation model codebase. Evaluated on medical visual question answering and radiology report generation, the models achieve state-of-the-art or near-state-of-the-art performance.
📝 Abstract
Medical image analysis is a fundamental component. As deep learning progresses, the focus has shifted from single-task applications, such as classification and segmentation, to more complex multimodal tasks, including medical visual question answering and report generation. Traditional shallow and task-specific models are increasingly limited in addressing the complexity and scalability required in clinical practice. The emergence of large language models (LLMs) has driven the development of medical Large Vision-Language Models (LVLMs), offering a unified solution for diverse vision-language tasks. In this study, we investigate various architectural designs for medical LVLMs based on the widely adopted LLaVA framework, which follows an encoder-connector-LLM paradigm. We construct two distinct models targeting 2D and 3D modalities, respectively. These models are designed to support both general-purpose medical tasks and domain-specific fine-tuning, thereby serving as effective foundation models. To facilitate reproducibility and further research, we develop a modular and extensible codebase, MedM-VL, and release two LVLM variants: MedM-VL-2D for 2D medical image analysis and MedM-VL-CT-Chest for 3D CT-based applications. The code and models are available at: https://github.com/MSIIP/MedM-VL