🤖 AI Summary
To address hallucination in visual grounding for Medical Visual Question Answering (Med-VQA)—caused by complex biomedical features, scarce annotated data, and poor model generalization—this paper proposes: (1) Med-CLIP-guided rotary position encoding to enhance spatial modeling in medical imaging; (2) the first cross-modal clinical knowledge distillation framework, explicitly injecting expert priors into student models; and (3) a multi-task joint optimization strategy integrating multi-stage feedback training with cross-modal alignment. Evaluated on Med-GRIT-270k, our approach achieves state-of-the-art performance, significantly mitigating hallucination and improving grounding accuracy. The proposed rotary position encoding is model-agnostic and readily adaptable to diverse architectures, establishing a new paradigm for designing medical multimodal foundation models.
📝 Abstract
Med-VQA (Medical Visual Question Answering) is a crucial subtask within the broader VQA (Visual Question Answering) domain. This task requires a visual question answering system to analyze the provided image and corresponding question,offering reasonable analysis and suggestions to assist medical professionals in making pathological diagnoses, or ideally, enabling the system to independently provide correct diagnoses. Furthermore, more advanced Med-VQA tasks involve Referring and Grounding, which not only require the system to accurately comprehend medical images but also to pinpoint specific biological locations within those images. While many large pre-trained models have demonstrated substantial VQA capabilities,challenges persist in the medical imaging domain. The intricacy of biological features in medical images and the scarcity of high-quality medical image datasets, combined with the fact that current models are not tailored for the medical field in terms of architecture and training paradigms, hinder the full exploitation of model generalization. This results in issues such as hallucination in Visual Grounding. In this paper, we introduce the ClinKD model, which incorporates modifications to model position encoding and a diversified training process. Initially, we enhance the model's ability to perceive image and modality variations by using Med-CLIP Guided Rotary Position Embedding. Subsequently, we leverage distillation to provide prior knowledge to the model before using complete training data. Additionally, the feedback-based training process during the formal training phase further enhances data utilization. Notably, under unchanged evaluation protocols, we achieve a new state-of-the-art performance on the Med-GRIT-270k dataset, and the Med-CLIP Guided Rotary Position Embedding approach presents potential for generalizing to universal model position encoding.