Unlocking Patch-Level Features for CLIP-Based Class-Incremental Learning

📅 2026-05-13

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

This work addresses the limitations of existing CLIP-based class-incremental learning methods, which rely solely on global image embeddings and overlook the rich local semantic information within the encoder, thereby suffering from suboptimal recognition performance and severe catastrophic forgetting. To overcome these issues, the authors propose Semantic-guided Patch Alignment (SPA), a novel approach that leverages GPT-generated category-level semantic descriptions to guide the selection of discriminative local visual features and employs optimal transport to achieve fine-grained cross-modal alignment between image patches and semantic tokens. Additionally, SPA incorporates task-specific projectors and a Gaussian pseudo-feature sampling mechanism to mitigate forgetting. This method is the first to systematically exploit patch-level semantics in CLIP-based class-incremental learning, achieving state-of-the-art performance across multiple benchmarks.

📝 Abstract

Class-Incremental Learning (CIL) enables models to continuously integrate new knowledge while mitigating catastrophic forgetting. Driven by the remarkable generalization of CLIP, leveraging pre-trained vision-language models has become a dominant paradigm in CIL. However, current work primarily focuses on aligning global image embeddings (i.e., [CLS] token) with their corresponding text prompts (i.e., [EOS] token). Despite their good performance, we find that they discard the rich patch-level semantic information inherent in CLIP's encoders. For instance, when recognizing a rabbit, local patches may encode its distinctive cues, such as long ears and a fluffy tail, which can provide complementary evidence for recognition. Based on the above observation, we propose SPA (Semantic-guided Patch-level Alignment) for CLIP-based CIL, which aims to awaken long-neglected local representations within CLIP. Specifically, for each class, we first construct representative and diverse visual samples and feed them to GPT-5 as visual guidance to generate class-wise semantic descriptions. These descriptions are used to guide the selection of discriminative patch-level visual features. Building upon these selected patches, we further employ optimal transport to align selected patch tokens with semantic tokens from class-wise descriptions, yielding a structured cross-modal alignment that improves recognition. Furthermore, we introduce task-specific projectors for effective adaptation to downstream incremental tasks, and sample pseudo-features from stored class-wise Gaussian statistics to calibrate old-class representations, thereby mitigating catastrophic forgetting. Extensive experiments demonstrate that SPA achieves state-of-the-art performance.

Problem

Research questions and friction points this paper is trying to address.

Class-Incremental Learning

CLIP

Patch-level Features

Catastrophic Forgetting

Vision-Language Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

patch-level alignment

CLIP-based class-incremental learning

semantic-guided feature selection