๐ค AI Summary
To address weak interpretability and insufficient reasoning capability in gastrointestinal endoscopic visual question answering (VQA), this paper proposes a LoRA-based multi-task collaborative learning framework that jointly models VQA, natural language explanation generation, and visual grounding. Methodologically, we adopt Florence-2 as the backbone and integrate Kvasir-VQA-x1, synthetically generated explanations, and text-region alignment data to achieve cross-modal semantic alignment and joint modeling of medical logic. Our key contribution is the first end-to-end, multi-task interpretable reasoning architecture for gastrointestinal medical VQA, enabled by parameter-efficient fine-tuning to ensure task synergy and generalization. Experiments demonstrate significant improvements over single-task baselines: +4.2% in answer accuracy and +6.8% in grounding IoUโvalidating the effectiveness of multi-task learning in enhancing both medical visual reasoning and model interpretability.
๐ Abstract
We present a multi-task framework for the MediaEval Medico 2025 challenge, leveraging a LoRA-tuned Florence-2 model for simultaneous visual question answering (VQA), explanation generation, and visual grounding. The proposed system integrates three curated datasets: (1) Kvasir-VQA-x1 for question-answer learning, (2) a synthetically enriched explanation dataset offering structured medical reasoning, and (3) text-to-region pairs linking visual features with segmentation masks. This multi-task setup enables the model to jointly learn visual grounding, reasoning, and interpretation, producing responses that are both accurate and interpretable. Extensive evaluation demonstrates that our approach substantially improves over single-task baselines in both answer accuracy and visual localization, highlighting the effectiveness of grounded multi-task learning for medical VQA applications.