🤖 AI Summary
CLIP-based class-incremental learning (CIL) suffers from high complexity and catastrophic forgetting due to reliance on additional learnable modules, while underutilizing cross-modal representation fusion. Method: We propose a parameter-free incremental adaptation framework that exclusively operates within CLIP’s pre-existing cross-modal bridging layers—eliminating auxiliary parameters. We introduce an orthogonal low-rank fusion mechanism to constrain weight updates without historical data replay, effectively mitigating forgetting. Furthermore, we construct vision-text hybrid prototypes to enhance discriminability via cross-modal collaboration. Contribution/Results: Evaluated on multiple standard benchmarks, our method achieves higher average accuracy and lower forgetting rates with significantly reduced computational overhead. It establishes a new paradigm for efficient, stable, and lightweight multimodal incremental learning.
📝 Abstract
Class-Incremental Learning (CIL) aims to continually learn new categories without forgetting previously acquired knowledge. Vision-language models such as CLIP offer strong transferable representations via multi-modal supervision, making them promising for CIL. However, applying CLIP to CIL poses two major challenges: (1) adapting to downstream tasks often requires additional learnable modules, increasing model complexity and susceptibility to forgetting; and (2) while multi-modal representations offer complementary strengths, existing methods have yet to fully realize their potential in effectively integrating visual and textual modalities. To address these issues, we propose BOFA (Bridge-layer Orthogonal Fusion for Adaptation), a novel framework for CIL. BOFA confines all model adaptation exclusively to CLIP's existing cross-modal bridge-layer, thereby adding no extra parameters or inference cost. To prevent forgetting within this layer, it leverages Orthogonal Low-Rank Fusion, a mechanism that constrains parameter updates to a low-rank ``safe subspace"mathematically constructed to be orthogonal to past task features. This ensures stable knowledge accumulation without data replay. Furthermore, BOFA employs a cross-modal hybrid prototype that synergizes stable textual prototypes with visual counterparts derived from our stably adapted bridge-layer, enhancing classification performance. Extensive experiments on standard benchmarks show that BOFA achieves superior accuracy and efficiency compared to existing methods.