Efficient Bilinear Attention-based Fusion for Medical Visual Question Answering

📅 2024-10-28
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational cost and deployment challenges of medical visual question answering (MedVQA) in resource-constrained clinical settings, this paper proposes OMniBAN, a lightweight cross-modal fusion model. Methodologically, it introduces orthogonal regularization jointly applied to multi-head attention and bilinear attention networks (BAN), enabling efficient visual–language feature alignment with minimal redundancy; it further designs a streamlined cross-modal fusion architecture. Evaluated on major MedVQA benchmarks, OMniBAN reduces model parameters by 61% and accelerates inference by 42% compared to state-of-the-art counterparts, while maintaining competitive accuracy—achieving new state-of-the-art performance in parameter efficiency and inference speed. These improvements significantly enhance deployability and practical utility in real-world clinical environments.

Technology Category

Application Category

📝 Abstract
Medical Visual Question Answering (MedVQA) has attracted growing interest at the intersection of computer vision and natural language processing. By interpreting medical images and providing precise answers to relevant clinical inquiries, MedVQA has the potential to support diagnostic decision-making and reduce workload across various domains, particularly radiology. While recent approaches rely heavily on unified large pre-trained Visual-Language Models, research on more efficient fusion mechanisms remains relatively limited in this domain. In this paper, we introduce a novel fusion model, OMniBAN, that integrates Orthogonality loss, Multi-head attention, and a Bilinear Attention Network to achieve high computational efficiency alongside solid performance. We conduct comprehensive experiments and provide insights into how bilinear attention fusion can approximate the performance of larger fusion models like cross-modal Transformer. Our results demonstrate that OMniBAN outperforms traditional approaches on key MedVQA benchmarks while maintaining a lower computational cost. This balance between efficiency and accuracy suggests that OMniBAN could be a viable option for real-world medical image question answering, where computational resources are often constrained.
Problem

Research questions and friction points this paper is trying to address.

Efficient fusion mechanisms
Medical Visual Question Answering
Computational efficiency and accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Orthogonality loss integration
Multi-head attention mechanism
Bilinear Attention Network
🔎 Similar Papers
No similar papers found.
Z
Zhilin Zhang
Tandon School of Engineering, New York University, New York, USA
J
Jie Wang
College of Electronic and Information Engineering, Tongji University, Shanghai, China
Ruiqi Zhu
Ruiqi Zhu
King's College London
Reinforcement LearningMachine LearningSurgical Robots
X
Xiaoliang Gong
College of Electronic and Information Engineering, Tongji University, Shanghai, China