MM-Prompt: Cross-Modal Prompt Tuning for Continual Visual Question Answering

๐Ÿ“… 2025-05-26
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address modality imbalance and representation drift in continual visual question answering (CVQA) caused by the isolation of visual and textual prompts in pretrained multimodal models, this paper proposes a cross-modal prompt querying and recovery mechanism. Our method employs cross-signal-guided balanced prompt selection to jointly optimize visual and linguistic prompts, and integrates iterative joint reconstruction with alignment-constrained losses to suppress modality participation skew during prompt tuning. Evaluated on multiple CVQA benchmarks, our approach achieves significant performance gainsโ€”improving average accuracy, enhancing knowledge retention, increasing modality participation balance by 32%, and reducing forgetting rate by 41%. To the best of our knowledge, this is the first work to enable dynamic, cross-modal prompt co-modeling while ensuring long-term stability in continual multimodal learning.

Technology Category

Application Category

๐Ÿ“ Abstract
Continual Visual Question Answering (CVQA) based on pre-trained models(PTMs) has achieved promising progress by leveraging prompt tuning to enable continual multi-modal learning. However, most existing methods adopt cross-modal prompt isolation, constructing visual and textual prompts separately, which exacerbates modality imbalance and leads to degraded performance over time. To tackle this issue, we propose MM-Prompt, a novel framework incorporating cross-modal prompt query and cross-modal prompt recovery. The former enables balanced prompt selection by incorporating cross-modal signals during query formation, while the latter promotes joint prompt reconstruction through iterative cross-modal interactions, guided by an alignment loss to prevent representational drift. Extensive experiments show that MM-Prompt surpasses prior approaches in accuracy and knowledge retention, while maintaining balanced modality engagement throughout continual learning.
Problem

Research questions and friction points this paper is trying to address.

Addresses modality imbalance in continual visual question answering
Improves cross-modal prompt selection and reconstruction
Enhances accuracy and knowledge retention in CVQA
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-modal prompt query for balanced selection
Cross-modal prompt recovery via iterative interactions
Alignment loss prevents representational drift
๐Ÿ”Ž Similar Papers
No similar papers found.
X
Xu Li
Khoury College of Computer Sciences, Northeastern University
Fan Lyu
Fan Lyu
NLPR, CASIA
Computer VisionMachine LearningArtificial Intelligence