BOFA: Bridge-Layer Orthogonal Low-Rank Fusion for CLIP-Based Class-Incremental Learning

📅 2025-11-14

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

CLIP-based class-incremental learning (CIL) suffers from high complexity and catastrophic forgetting due to reliance on additional learnable modules, while underutilizing cross-modal representation fusion. Method: We propose a parameter-free incremental adaptation framework that exclusively operates within CLIP’s pre-existing cross-modal bridging layers—eliminating auxiliary parameters. We introduce an orthogonal low-rank fusion mechanism to constrain weight updates without historical data replay, effectively mitigating forgetting. Furthermore, we construct vision-text hybrid prototypes to enhance discriminability via cross-modal collaboration. Contribution/Results: Evaluated on multiple standard benchmarks, our method achieves higher average accuracy and lower forgetting rates with significantly reduced computational overhead. It establishes a new paradigm for efficient, stable, and lightweight multimodal incremental learning.

Technology Category

Application Category

📝 Abstract

Class-Incremental Learning (CIL) aims to continually learn new categories without forgetting previously acquired knowledge. Vision-language models such as CLIP offer strong transferable representations via multi-modal supervision, making them promising for CIL. However, applying CLIP to CIL poses two major challenges: (1) adapting to downstream tasks often requires additional learnable modules, increasing model complexity and susceptibility to forgetting; and (2) while multi-modal representations offer complementary strengths, existing methods have yet to fully realize their potential in effectively integrating visual and textual modalities. To address these issues, we propose BOFA (Bridge-layer Orthogonal Fusion for Adaptation), a novel framework for CIL. BOFA confines all model adaptation exclusively to CLIP's existing cross-modal bridge-layer, thereby adding no extra parameters or inference cost. To prevent forgetting within this layer, it leverages Orthogonal Low-Rank Fusion, a mechanism that constrains parameter updates to a low-rank ``safe subspace"mathematically constructed to be orthogonal to past task features. This ensures stable knowledge accumulation without data replay. Furthermore, BOFA employs a cross-modal hybrid prototype that synergizes stable textual prototypes with visual counterparts derived from our stably adapted bridge-layer, enhancing classification performance. Extensive experiments on standard benchmarks show that BOFA achieves superior accuracy and efficiency compared to existing methods.

Problem

Research questions and friction points this paper is trying to address.

Adapting CLIP to class-incremental learning without adding extra parameters

Preventing catastrophic forgetting through orthogonal low-rank subspace updates

Effectively integrating visual and textual modalities for enhanced classification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapts CLIP exclusively at cross-modal bridge-layer

Uses orthogonal low-rank fusion to prevent forgetting

Employs cross-modal hybrid prototypes for classification

🔎 Similar Papers

No similar papers found.