CRISP: Contrastive Residual Injection and Semantic Prompting for Continual Video Instance Segmentation

📅 2025-08-14

📈 Citations: 0

✨ Influential: 0

career value

158K/year

🤖 AI Summary

Continual video instance segmentation faces three key challenges: catastrophic forgetting in class-incremental learning, instance confusion, and temporal inconsistency. To address these, we propose a unified framework integrating contrastive learning with residual semantic prompting. Specifically, we introduce an instance association loss to enforce inter-frame consistency; design an adaptive residual semantic prompt pool for class-aware, learnable feature enhancement; and incorporate cross-task prompt initialization with query-prompt matching. Notably, we are the first to embed contrastive learning into semantic consistency constraints to jointly balance plasticity and stability. Our method achieves state-of-the-art performance on YouTube-VIS 2019 and 2021, outperforming existing continual learning approaches by up to +4.2% mAP in long-term incremental settings. The source code is publicly available.

Technology Category

Application Category

📝 Abstract

Continual video instance segmentation demands both the plasticity to absorb new object categories and the stability to retain previously learned ones, all while preserving temporal consistency across frames. In this work, we introduce Contrastive Residual Injection and Semantic Prompting (CRISP), an earlier attempt tailored to address the instance-wise, category-wise, and task-wise confusion in continual video instance segmentation. For instance-wise learning, we model instance tracking and construct instance correlation loss, which emphasizes the correlation with the prior query space while strengthening the specificity of the current task query. For category-wise learning, we build an adaptive residual semantic prompt (ARSP) learning framework, which constructs a learnable semantic residual prompt pool generated by category text and uses an adjustive query-prompt matching mechanism to build a mapping relationship between the query of the current task and the semantic residual prompt. Meanwhile, a semantic consistency loss based on the contrastive learning is introduced to maintain semantic coherence between object queries and residual prompts during incremental training. For task-wise learning, to ensure the correlation at the inter-task level within the query space, we introduce a concise yet powerful initialization strategy for incremental prompts. Extensive experiments on YouTube-VIS-2019 and YouTube-VIS-2021 datasets demonstrate that CRISP significantly outperforms existing continual segmentation methods in the long-term continual video instance segmentation task, avoiding catastrophic forgetting and effectively improving segmentation and classification performance. The code is available at https://github.com/01upup10/CRISP.

Problem

Research questions and friction points this paper is trying to address.

Address instance-wise confusion in continual video segmentation

Resolve category-wise confusion via adaptive semantic prompts

Mitigate task-wise confusion with incremental prompt initialization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Contrastive Residual Injection for instance-wise learning

Adaptive Residual Semantic Prompt for category-wise learning

Initialization strategy for inter-task correlation

🔎 Similar Papers

Context-Aware Video Instance Segmentation