Efficient Bilinear Attention-based Fusion for Medical Visual Question Answering

📅 2024-10-28

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

To address the high computational cost and deployment challenges of medical visual question answering (MedVQA) in resource-constrained clinical settings, this paper proposes OMniBAN, a lightweight cross-modal fusion model. Methodologically, it introduces orthogonal regularization jointly applied to multi-head attention and bilinear attention networks (BAN), enabling efficient visual–language feature alignment with minimal redundancy; it further designs a streamlined cross-modal fusion architecture. Evaluated on major MedVQA benchmarks, OMniBAN reduces model parameters by 61% and accelerates inference by 42% compared to state-of-the-art counterparts, while maintaining competitive accuracy—achieving new state-of-the-art performance in parameter efficiency and inference speed. These improvements significantly enhance deployability and practical utility in real-world clinical environments.

Technology Category

Application Category

📝 Abstract

Medical Visual Question Answering (MedVQA) has attracted growing interest at the intersection of computer vision and natural language processing. By interpreting medical images and providing precise answers to relevant clinical inquiries, MedVQA has the potential to support diagnostic decision-making and reduce workload across various domains, particularly radiology. While recent approaches rely heavily on unified large pre-trained Visual-Language Models, research on more efficient fusion mechanisms remains relatively limited in this domain. In this paper, we introduce a novel fusion model, OMniBAN, that integrates Orthogonality loss, Multi-head attention, and a Bilinear Attention Network to achieve high computational efficiency alongside solid performance. We conduct comprehensive experiments and provide insights into how bilinear attention fusion can approximate the performance of larger fusion models like cross-modal Transformer. Our results demonstrate that OMniBAN outperforms traditional approaches on key MedVQA benchmarks while maintaining a lower computational cost. This balance between efficiency and accuracy suggests that OMniBAN could be a viable option for real-world medical image question answering, where computational resources are often constrained.

Problem

Research questions and friction points this paper is trying to address.

Efficient fusion mechanisms

Medical Visual Question Answering

Computational efficiency and accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Orthogonality loss integration

Multi-head attention mechanism

Bilinear Attention Network

🔎 Similar Papers

No similar papers found.