A Lightweight Large Vision-language Model for Multimodal Medical Images

📅 2025-04-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Medical visual question answering (VQA) faces challenges in interpreting heterogeneous medical imaging modalities (e.g., X-ray, CT, MRI), model bloat, and limited clinical deployability. To address these, we propose MedVQA-LLM—a lightweight multimodal large language model. Our approach innovatively integrates the BiomedCLIP image encoder with the LLaMA-3 language model, enhanced by cross-modal alignment fine-tuning and parameter-efficient adaptation techniques, substantially reducing computational overhead. On the OmniMedVQA benchmark, MedVQA-LLM achieves a state-of-the-art 73.4% open-domain VQA accuracy. It operates efficiently on just two 40GB A100 GPUs, with approximately 8 billion parameters and 3.2× faster inference than billion-parameter baselines. To our knowledge, this is the first work to jointly achieve high accuracy and low-resource efficiency in medical VQA, enabling practical edge deployment in clinical settings.

Technology Category

Application Category

📝 Abstract
Medical Visual Question Answering (VQA) enhances clinical decision-making by enabling systems to interpret medical images and answer clinical queries. However, developing efficient, high-performance VQA models is challenging due to the complexity of medical imagery and diverse modalities. In this paper, we introduce a lightweight, multimodal VQA model integrating BiomedCLIP for image feature extraction and LLaMA-3 for text processing. Designed for medical VQA tasks, our model achieves state-of-the-art performance on the OmniMedVQA dataset. With approximately 8 billion parameters, it requires only two NVIDIA 40 GB A100 GPUs, demonstrating superior efficiency over larger models. Our results show 73.4% accuracy for open-end questions, surpassing existing models and validating its potential for real-world medical applications. Key contributions include a specialized multimodal VQA model, a resource-efficient architecture, and strong performance in answering open-ended clinical questions.
Problem

Research questions and friction points this paper is trying to address.

Develops lightweight VQA model for medical images
Integrates BiomedCLIP and LLaMA-3 for multimodal processing
Achieves high accuracy in clinical question answering
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates BiomedCLIP and LLaMA-3 for multimodal processing
Lightweight model with 8 billion parameters
Achieves 73.4% accuracy on open-end questions
🔎 Similar Papers
No similar papers found.
B
Belal Alsinglawi
Swinburne University of Technology, Melbourne, Australia
Chris McCarthy
Chris McCarthy
Associate Professor, Swinburne University of Technology
Computer VisionAssistive TechnologyRoboticsSmart CitiesInternet of Things
S
Sara Webb
Swinburne University of Technology, Melbourne, Australia
C
Christopher Fluke
Swinburne University of Technology, Melbourne, Australia
N
Navid Toosy Saidy
PropelHealthAI, Brisbane, Australia