A Lightweight Large Vision-language Model for Multimodal Medical Images

📅 2025-04-08

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

Medical visual question answering (VQA) faces challenges in interpreting heterogeneous medical imaging modalities (e.g., X-ray, CT, MRI), model bloat, and limited clinical deployability. To address these, we propose MedVQA-LLM—a lightweight multimodal large language model. Our approach innovatively integrates the BiomedCLIP image encoder with the LLaMA-3 language model, enhanced by cross-modal alignment fine-tuning and parameter-efficient adaptation techniques, substantially reducing computational overhead. On the OmniMedVQA benchmark, MedVQA-LLM achieves a state-of-the-art 73.4% open-domain VQA accuracy. It operates efficiently on just two 40GB A100 GPUs, with approximately 8 billion parameters and 3.2× faster inference than billion-parameter baselines. To our knowledge, this is the first work to jointly achieve high accuracy and low-resource efficiency in medical VQA, enabling practical edge deployment in clinical settings.

Technology Category

Application Category

📝 Abstract

Medical Visual Question Answering (VQA) enhances clinical decision-making by enabling systems to interpret medical images and answer clinical queries. However, developing efficient, high-performance VQA models is challenging due to the complexity of medical imagery and diverse modalities. In this paper, we introduce a lightweight, multimodal VQA model integrating BiomedCLIP for image feature extraction and LLaMA-3 for text processing. Designed for medical VQA tasks, our model achieves state-of-the-art performance on the OmniMedVQA dataset. With approximately 8 billion parameters, it requires only two NVIDIA 40 GB A100 GPUs, demonstrating superior efficiency over larger models. Our results show 73.4% accuracy for open-end questions, surpassing existing models and validating its potential for real-world medical applications. Key contributions include a specialized multimodal VQA model, a resource-efficient architecture, and strong performance in answering open-ended clinical questions.

Problem

Research questions and friction points this paper is trying to address.

Develops lightweight VQA model for medical images

Integrates BiomedCLIP and LLaMA-3 for multimodal processing

Achieves high accuracy in clinical question answering

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates BiomedCLIP and LLaMA-3 for multimodal processing

Lightweight model with 8 billion parameters

Achieves 73.4% accuracy on open-end questions

🔎 Similar Papers

No similar papers found.