CMI-MTL: Cross-Mamba interaction based multi-task learning for medical visual question answering

📅 2025-11-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address coarse-grained cross-modal semantic alignment and weak free-form answer generation in medical visual question answering (Med-VQA), this paper proposes a cross-modal interleaved interaction framework. Methodologically, it innovatively integrates the Mamba architecture for efficient long-range vision–language sequence modeling, designs a fine-grained vision–text alignment mechanism and a cross-representation module, and introduces an open-answer-augmented multi-task learning objective to transcend conventional classification-based paradigms. Evaluated on three benchmarks—VQA-RAD, SLAKE, and OVQA—the framework achieves significant improvements over state-of-the-art methods. Interpretability analysis further confirms its enhanced fine-grained alignment capability and generative fidelity. This work establishes a more flexible, robust, and semantically rich multimodal understanding paradigm for Med-VQA, advancing beyond rigid answer-space constraints toward open-ended, clinically grounded reasoning.

Technology Category

Application Category

📝 Abstract
Medical visual question answering (Med-VQA) is a crucial multimodal task in clinical decision support and telemedicine. Recent self-attention based methods struggle to effectively handle cross-modal semantic alignments between vision and language. Moreover, classification-based methods rely on predefined answer sets. Treating this task as a simple classification problem may make it unable to adapt to the diversity of free-form answers and overlook the detailed semantic information of free-form answers. In order to tackle these challenges, we introduce a Cross-Mamba Interaction based Multi-Task Learning (CMI-MTL) framework that learns cross-modal feature representations from images and texts. CMI-MTL comprises three key modules: fine-grained visual-text feature alignment (FVTA), cross-modal interleaved feature representation (CIFR), and free-form answer-enhanced multi-task learning (FFAE). FVTA extracts the most relevant regions in image-text pairs through fine-grained visual-text feature alignment. CIFR captures cross-modal sequential interactions via cross-modal interleaved feature representation. FFAE leverages auxiliary knowledge from open-ended questions through free-form answer-enhanced multi-task learning, improving the model's capability for open-ended Med-VQA. Experimental results show that CMI-MTL outperforms the existing state-of-the-art methods on three Med-VQA datasets: VQA-RAD, SLAKE, and OVQA. Furthermore, we conduct more interpretability experiments to prove the effectiveness. The code is publicly available at https://github.com/BioMedIA-repo/CMI-MTL.
Problem

Research questions and friction points this paper is trying to address.

Addressing cross-modal alignment challenges in medical VQA
Overcoming limitations of classification-based answer generation
Enhancing model adaptability to free-form medical answers
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-Mamba Interaction for cross-modal feature learning
Fine-grained visual-text alignment for relevant region extraction
Free-form answer-enhanced multi-task learning for open-ended VQA
🔎 Similar Papers
No similar papers found.
Qiangguo Jin
Qiangguo Jin
Northwestern Polytechnical University
Artificial IntelligenceDeep LearningComputer VisionMedical Image AnalysisBioinformatics
X
Xianyao Zheng
School of Software, Northwestern Polytechnical University, Shaanxi, China
H
Hui Cui
Department of Computer Science and Information Technology, La Trobe University, Melbourne, Australia
Changming Sun
Changming Sun
CSIRO Data61
Computer VisionImage ProcessingPattern RecognitionDeep Learning
Y
Yuqi Fang
School of Intelligence Science and Technology, Nanjing University, Suzhou, China
C
Cong Cong
Australian Institute of Health Innovation (AIHI), Macquarie University, Australia
Ran Su
Ran Su
Tianjin University
Medical imagingbioinformatics
L
Leyi Wei
Centre for Artificial Intelligence driven Drug Discovery, Faculty of Applied Science, Macao Polytechnic University, Macao Special Administrative Region of China
Ping Xuan
Ping Xuan
Hainan University
Complex Network AnalysisMedical Image SegmentationDeep LearningArtificial Intelligence for
J
Junbo Wang
School of Software, Northwestern Polytechnical University, Shaanxi, China