🤖 AI Summary
Medical visual question answering (VQA) faces challenges in interpreting heterogeneous medical imaging modalities (e.g., X-ray, CT, MRI), model bloat, and limited clinical deployability. To address these, we propose MedVQA-LLM—a lightweight multimodal large language model. Our approach innovatively integrates the BiomedCLIP image encoder with the LLaMA-3 language model, enhanced by cross-modal alignment fine-tuning and parameter-efficient adaptation techniques, substantially reducing computational overhead. On the OmniMedVQA benchmark, MedVQA-LLM achieves a state-of-the-art 73.4% open-domain VQA accuracy. It operates efficiently on just two 40GB A100 GPUs, with approximately 8 billion parameters and 3.2× faster inference than billion-parameter baselines. To our knowledge, this is the first work to jointly achieve high accuracy and low-resource efficiency in medical VQA, enabling practical edge deployment in clinical settings.
📝 Abstract
Medical Visual Question Answering (VQA) enhances clinical decision-making by enabling systems to interpret medical images and answer clinical queries. However, developing efficient, high-performance VQA models is challenging due to the complexity of medical imagery and diverse modalities. In this paper, we introduce a lightweight, multimodal VQA model integrating BiomedCLIP for image feature extraction and LLaMA-3 for text processing. Designed for medical VQA tasks, our model achieves state-of-the-art performance on the OmniMedVQA dataset. With approximately 8 billion parameters, it requires only two NVIDIA 40 GB A100 GPUs, demonstrating superior efficiency over larger models. Our results show 73.4% accuracy for open-end questions, surpassing existing models and validating its potential for real-world medical applications. Key contributions include a specialized multimodal VQA model, a resource-efficient architecture, and strong performance in answering open-ended clinical questions.