KG-CMI: Knowledge graph enhanced cross-Mamba interaction for medical visual question answering

📅 2026-04-01

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

Existing medical visual question answering (VQA) approaches are constrained by predefined answer sets, limiting their ability to integrate domain knowledge and precisely link lesion characteristics with diagnostic criteria. To overcome these limitations, this work proposes the KG-CMI framework, which uniquely integrates knowledge graphs with the Mamba architecture. By leveraging fine-grained cross-modal alignment, knowledge graph embeddings, Mamba-based cross-modal interaction, and multi-task learning for free-form answers, KG-CMI transcends conventional classification paradigms. The proposed method achieves state-of-the-art performance across multiple benchmarks, including VQA-RAD, SLAKE, and OVQA. Furthermore, interpretability experiments substantiate its efficacy and capacity for meaningful knowledge fusion, demonstrating its potential to support more accurate and explainable medical VQA systems.

Technology Category

Application Category

📝 Abstract

Medical visual question answering (Med-VQA) is a crucial multimodal task in clinical decision support and telemedicine. Recent methods fail to fully leverage domain-specific medical knowledge, making it difficult to accurately associate lesion features in medical images with key diagnostic criteria. Additionally, classification-based approaches typically rely on predefined answer sets. Treating Med-VQA as a simple classification problem limits its ability to adapt to the diversity of free-form answers and may overlook detailed semantic information in those answers. To address these challenges, we propose a knowledge graph enhanced cross-Mamba interaction (KG-CMI) framework, which consists of a fine-grained cross-modal feature alignment (FCFA) module, a knowledge graph embedding (KGE) module, a cross-modal interaction representation (CMIR) module, and a free-form answer enhanced multi-task learning (FAMT) module. The KG-CMI learns cross-modal feature representations for images and texts by effectively integrating professional medical knowledge through a graph, establishing associations between lesion features and disease knowledge. Moreover, FAMT leverages auxiliary knowledge from open-ended questions, improving the model's capability for open-ended Med-VQA. Experimental results demonstrate that KG-CMI outperforms existing state-of-the-art methods on three Med-VQA datasets, i.e., VQA-RAD, SLAKE, and OVQA. Additionally, we conduct interpretability experiments to further validate the framework's effectiveness.

Problem

Research questions and friction points this paper is trying to address.

Medical visual question answering

domain-specific knowledge

lesion feature association

free-form answers

multimodal task

Innovation

Methods, ideas, or system contributions that make the work stand out.

knowledge graph

cross-modal interaction

Mamba architecture