🤖 AI Summary
This work addresses the challenges of poor task adaptability and catastrophic forgetting in vision-language models under continual learning settings by proposing an efficient, task-ID-free approach based on LoRA. The method decouples LoRA into a shared pool of rank-1 experts and dynamically composes task-specific updates through sparse combinations guided by the semantic content of the [CLS] token, substantially reducing parameter overhead. Additionally, an Activation-Guided Orthogonality (AGO) loss is introduced to mitigate interference across tasks. Requiring no external knowledge, task identifiers, or additional inference latency, the approach achieves state-of-the-art performance across multiple benchmarks with only 3.3% trainable parameters, demonstrating generalization capabilities that even surpass the zero-shot upper bound.
📝 Abstract
Continual learning (CL) in vision-language models (VLMs) faces significant challenges in improving task adaptation and avoiding catastrophic forgetting. Existing methods usually have heavy inference burden or rely on external knowledge, while Low-Rank Adaptation (LoRA) has shown potential in reducing these issues by enabling parameter-efficient tuning. However, considering directly using LoRA to alleviate the catastrophic forgetting problem is non-trivial, we introduce a novel framework that restructures a single LoRA module as a decomposable Rank-1 Expert Pool. Our method learns to dynamically compose a sparse, task-specific update by selecting from this expert pool, guided by the semantics of the [CLS] token. In addition, we propose an Activation-Guided Orthogonal (AGO) loss that orthogonalizes critical parts of LoRA weights across tasks. This sparse composition and orthogonalization enable fewer parameter updates, resulting in domain-aware learning while minimizing inter-task interference and maintaining downstream task performance. Extensive experiments across multiple settings demonstrate state-of-the-art results in all metrics, surpassing zero-shot upper bounds in generalization. Notably, it reduces trainable parameters by 96.7% compared to the baseline method, eliminating reliance on external datasets or task-ID discriminators. The merged LoRAs retain less weights and incur no inference latency, making our method computationally lightweight.