AIM: Asymmetric Information Masking for Visual Question Answering Continual Learning

📅 2026-04-16

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Existing continual learning methods struggle to accommodate the asymmetric architecture of vision-language models, often leading to degradation of the visual projection layer and consequently causing catastrophic forgetting and impaired compositional reasoning. This work is the first to identify and address this modality-induced local degradation problem by introducing Asymmetric Information Masking (AIM)—a strategy that leverages modality sensitivity analysis and directional regularization to differentially protect visual and language components. The proposed method substantially outperforms current state-of-the-art approaches on VQA v2 and GQA benchmarks, achieving the best results in both average accuracy and forgetting metrics, while significantly enhancing generalization to novel skill-concept compositions.

Technology Category

Application Category

📝 Abstract

In continual visual question answering (VQA), existing Continual Learning (CL) methods are mostly built for symmetric, unimodal architectures. However, modern Vision-Language Models (VLMs) violate this assumption, as their trainable components are inherently asymmetric. This structural mismatch renders VLMs highly prone to catastrophic forgetting when learning from continuous data streams. Specifically, the asymmetry causes standard global regularization to favor the massive language decoder during optimization, leaving the smaller but critical visual projection layers highly vulnerable to interference. Consequently, this localized degradation leads to a severe loss of compositional reasoning capabilities. To address this, we propose Asymmetric Information Masking (AIM), which balances stability and plasticity by applying targeted masks based on modality-specific sensitivity. Experiments on VQA v2 and GQA under continual VQA settings show that AIM achieves state-of-the-art performance in both Average Performance (AP) and Average Forgetting (AF), while better preserving generalization to novel skill-concept compositions.

Problem

Research questions and friction points this paper is trying to address.

continual learning

visual question answering

asymmetric architecture

catastrophic forgetting

vision-language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Asymmetric Information Masking

Continual Learning

Vision-Language Models