🤖 AI Summary
To address catastrophic forgetting in continual visual question answering (VQA), where vision-language models (VLMs) rapidly lose prior knowledge upon learning new tasks, this paper proposes the first fully data-free continual learning method. Our approach requires no historical data storage nor external models; instead, it leverages the intrinsic language generation capability of a single VLM to automatically synthesize task-relevant questions on novel visual inputs, thereby constructing cross-modal pseudo-review samples. To mitigate distributional shift in these synthetic samples, we introduce a pseudo-review balancing module that jointly employs question meta-statistical modeling and K-means unsupervised clustering to align generated distributions with original task distributions. This work establishes the first purely VLM-driven, data-free continual VQA framework. On the VQACL-VQAv2 and CLOVE-function benchmarks, it substantially outperforms all existing data-free baselines and approaches the performance of strong oracle methods that retain access to historical data.
📝 Abstract
Vision-Language Models (VLMs) have shown significant promise in Visual Question Answering (VQA) tasks by leveraging web-scale multimodal datasets. However, these models often struggle with continual learning due to catastrophic forgetting when adapting to new tasks. As an effective remedy to mitigate catastrophic forgetting, rehearsal strategy uses the data of past tasks upon learning new task. However, such strategy incurs the need of storing past data, which might not be feasible due to hardware constraints or privacy concerns. In this work, we propose the first data-free method that leverages the language generation capability of a VLM, instead of relying on external models, to produce pseudo-rehearsal data for addressing continual VQA. Our proposal, named as GaB, generates pseudo-rehearsal data by posing previous task questions on new task data. Yet, despite being effective, the distribution of generated questions skews towards the most frequently posed questions due to the limited and task-specific training data. To mitigate this issue, we introduce a pseudo-rehearsal balancing module that aligns the generated data towards the ground-truth data distribution using either the question meta-statistics or an unsupervised clustering method. We evaluate our proposed method on two recent benchmarks, ie VQACL-VQAv2 and CLOVE-function benchmarks. GaB outperforms all the data-free baselines with substantial improvement in maintaining VQA performance across evolving tasks, while being on-par with methods with access to the past data.