Multi-Task Learning for Visually Grounded Reasoning in Gastrointestinal VQA

๐Ÿ“… 2025-11-06
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address weak interpretability and insufficient reasoning capability in gastrointestinal endoscopic visual question answering (VQA), this paper proposes a LoRA-based multi-task collaborative learning framework that jointly models VQA, natural language explanation generation, and visual grounding. Methodologically, we adopt Florence-2 as the backbone and integrate Kvasir-VQA-x1, synthetically generated explanations, and text-region alignment data to achieve cross-modal semantic alignment and joint modeling of medical logic. Our key contribution is the first end-to-end, multi-task interpretable reasoning architecture for gastrointestinal medical VQA, enabled by parameter-efficient fine-tuning to ensure task synergy and generalization. Experiments demonstrate significant improvements over single-task baselines: +4.2% in answer accuracy and +6.8% in grounding IoUโ€”validating the effectiveness of multi-task learning in enhancing both medical visual reasoning and model interpretability.

Technology Category

Application Category

๐Ÿ“ Abstract
We present a multi-task framework for the MediaEval Medico 2025 challenge, leveraging a LoRA-tuned Florence-2 model for simultaneous visual question answering (VQA), explanation generation, and visual grounding. The proposed system integrates three curated datasets: (1) Kvasir-VQA-x1 for question-answer learning, (2) a synthetically enriched explanation dataset offering structured medical reasoning, and (3) text-to-region pairs linking visual features with segmentation masks. This multi-task setup enables the model to jointly learn visual grounding, reasoning, and interpretation, producing responses that are both accurate and interpretable. Extensive evaluation demonstrates that our approach substantially improves over single-task baselines in both answer accuracy and visual localization, highlighting the effectiveness of grounded multi-task learning for medical VQA applications.
Problem

Research questions and friction points this paper is trying to address.

Developing multi-task learning for gastrointestinal visual question answering
Generating medical explanations with structured reasoning capabilities
Linking visual features to anatomical regions through grounding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-task framework with LoRA-tuned Florence-2 model
Integrates VQA, explanation generation, and visual grounding
Uses curated datasets for joint learning of reasoning
๐Ÿ”Ž Similar Papers
No similar papers found.
I
Itbaan Safwan
School of Mathematics and Computer Science, Institute of Business Administration (IBA), Karachi, Pakistan
M
Muhammad Annas Shaikh
School of Mathematics and Computer Science, Institute of Business Administration (IBA), Karachi, Pakistan
M
Muhammad Haaris
School of Mathematics and Computer Science, Institute of Business Administration (IBA), Karachi, Pakistan
R
Ramail Khan
School of Mathematics and Computer Science, Institute of Business Administration (IBA), Karachi, Pakistan
Muhammad Atif Tahir
Muhammad Atif Tahir
Institute of Business Administration, IBA, Karachi
Machine LearningArtificial Intelligence