Hierarchical Representation Matching for CLIP-based Class-Incremental Learning

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

Existing CLIP-based continual learning methods rely on a single handcrafted prompt template (e.g., “a photo of a [CLASS]”) and exclusively leverage the final-layer visual features, neglecting the inherent coarse-to-fine semantic hierarchy in visual concepts—leading to limited semantic discriminability and severe catastrophic forgetting. To address this, we propose HierCLIP, the first framework to introduce hierarchical text-feature dynamic matching: it employs large language models to generate explicit multi-level semantic descriptions and jointly aligns them with CLIP’s multi-layer visual representations. Furthermore, we design a task-adaptive dynamic routing mechanism to enable fine-grained, layer-specific feature alignment. Evaluated on multiple standard continual learning benchmarks, HierCLIP achieves substantial improvements in both classification accuracy and stability, setting new state-of-the-art performance.

Technology Category

Application Category

📝 Abstract

Class-Incremental Learning (CIL) aims to endow models with the ability to continuously adapt to evolving data streams. Recent advances in pre-trained vision-language models (e.g., CLIP) provide a powerful foundation for this task. However, existing approaches often rely on simplistic templates, such as "a photo of a [CLASS]", which overlook the hierarchical nature of visual concepts. For example, recognizing "cat" versus "car" depends on coarse-grained cues, while distinguishing "cat" from "lion" requires fine-grained details. Similarly, the current feature mapping in CLIP relies solely on the representation from the last layer, neglecting the hierarchical information contained in earlier layers. In this work, we introduce HiErarchical Representation MAtchiNg (HERMAN) for CLIP-based CIL. Our approach leverages LLMs to recursively generate discriminative textual descriptors, thereby augmenting the semantic space with explicit hierarchical cues. These descriptors are matched to different levels of the semantic hierarchy and adaptively routed based on task-specific requirements, enabling precise discrimination while alleviating catastrophic forgetting in incremental tasks. Extensive experiments on multiple benchmarks demonstrate that our method consistently achieves state-of-the-art performance.

Problem

Research questions and friction points this paper is trying to address.

Addresses hierarchical concept recognition in incremental learning

Enhances CLIP's feature mapping with multi-layer representations

Mitigates catastrophic forgetting through adaptive hierarchical matching

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates hierarchical textual descriptors using LLMs

Matches descriptors to semantic hierarchy levels adaptively

Routes features based on task-specific requirements precisely

🔎 Similar Papers

No similar papers found.