CMI-MTL: Cross-Mamba interaction based multi-task learning for medical visual question answering

📅 2025-11-03

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

To address coarse-grained cross-modal semantic alignment and weak free-form answer generation in medical visual question answering (Med-VQA), this paper proposes a cross-modal interleaved interaction framework. Methodologically, it innovatively integrates the Mamba architecture for efficient long-range vision–language sequence modeling, designs a fine-grained vision–text alignment mechanism and a cross-representation module, and introduces an open-answer-augmented multi-task learning objective to transcend conventional classification-based paradigms. Evaluated on three benchmarks—VQA-RAD, SLAKE, and OVQA—the framework achieves significant improvements over state-of-the-art methods. Interpretability analysis further confirms its enhanced fine-grained alignment capability and generative fidelity. This work establishes a more flexible, robust, and semantically rich multimodal understanding paradigm for Med-VQA, advancing beyond rigid answer-space constraints toward open-ended, clinically grounded reasoning.

Technology Category

Application Category

📝 Abstract

Medical visual question answering (Med-VQA) is a crucial multimodal task in clinical decision support and telemedicine. Recent self-attention based methods struggle to effectively handle cross-modal semantic alignments between vision and language. Moreover, classification-based methods rely on predefined answer sets. Treating this task as a simple classification problem may make it unable to adapt to the diversity of free-form answers and overlook the detailed semantic information of free-form answers. In order to tackle these challenges, we introduce a Cross-Mamba Interaction based Multi-Task Learning (CMI-MTL) framework that learns cross-modal feature representations from images and texts. CMI-MTL comprises three key modules: fine-grained visual-text feature alignment (FVTA), cross-modal interleaved feature representation (CIFR), and free-form answer-enhanced multi-task learning (FFAE). FVTA extracts the most relevant regions in image-text pairs through fine-grained visual-text feature alignment. CIFR captures cross-modal sequential interactions via cross-modal interleaved feature representation. FFAE leverages auxiliary knowledge from open-ended questions through free-form answer-enhanced multi-task learning, improving the model's capability for open-ended Med-VQA. Experimental results show that CMI-MTL outperforms the existing state-of-the-art methods on three Med-VQA datasets: VQA-RAD, SLAKE, and OVQA. Furthermore, we conduct more interpretability experiments to prove the effectiveness. The code is publicly available at https://github.com/BioMedIA-repo/CMI-MTL.

Problem

Research questions and friction points this paper is trying to address.

Addressing cross-modal alignment challenges in medical VQA

Overcoming limitations of classification-based answer generation

Enhancing model adaptability to free-form medical answers

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-Mamba Interaction for cross-modal feature learning

Fine-grained visual-text alignment for relevant region extraction

Free-form answer-enhanced multi-task learning for open-ended VQA

🔎 Similar Papers

No similar papers found.