๐ค AI Summary
To address modality imbalance and representation drift in continual visual question answering (CVQA) caused by the isolation of visual and textual prompts in pretrained multimodal models, this paper proposes a cross-modal prompt querying and recovery mechanism. Our method employs cross-signal-guided balanced prompt selection to jointly optimize visual and linguistic prompts, and integrates iterative joint reconstruction with alignment-constrained losses to suppress modality participation skew during prompt tuning. Evaluated on multiple CVQA benchmarks, our approach achieves significant performance gainsโimproving average accuracy, enhancing knowledge retention, increasing modality participation balance by 32%, and reducing forgetting rate by 41%. To the best of our knowledge, this is the first work to enable dynamic, cross-modal prompt co-modeling while ensuring long-term stability in continual multimodal learning.
๐ Abstract
Continual Visual Question Answering (CVQA) based on pre-trained models(PTMs) has achieved promising progress by leveraging prompt tuning to enable continual multi-modal learning. However, most existing methods adopt cross-modal prompt isolation, constructing visual and textual prompts separately, which exacerbates modality imbalance and leads to degraded performance over time. To tackle this issue, we propose MM-Prompt, a novel framework incorporating cross-modal prompt query and cross-modal prompt recovery. The former enables balanced prompt selection by incorporating cross-modal signals during query formation, while the latter promotes joint prompt reconstruction through iterative cross-modal interactions, guided by an alignment loss to prevent representational drift. Extensive experiments show that MM-Prompt surpasses prior approaches in accuracy and knowledge retention, while maintaining balanced modality engagement throughout continual learning.