Hierarchical Visual Prompt Learning for Continual Video Instance Segmentation

📅 2025-08-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address catastrophic forgetting of previously learned classes in continual video instance segmentation (VIS) caused by class-incremental learning, this paper proposes a hierarchical visual prompting framework. At the frame level, an orthogonal gradient correction (OGC) module mitigates parameter interference during optimization; at the video level, a context decoder models inter-frame structural relationships and global temporal dependencies, enabling dual-granularity knowledge retention. Crucially, this work is the first to jointly model task-specific frame-level and video-level prompts, explicitly decoupling and integrating local appearance semantics with global motion semantics. Evaluated on standard benchmarks including YouTube-VIS, the method significantly outperforms existing continual VIS approaches—maintaining strong performance on old classes while efficiently adapting to new ones. The source code is publicly available.

Technology Category

Application Category

📝 Abstract
Video instance segmentation (VIS) has gained significant attention for its capability in tracking and segmenting object instances across video frames. However, most of the existing VIS approaches unrealistically assume that the categories of object instances remain fixed over time. Moreover, they experience catastrophic forgetting of old classes when required to continuously learn object instances belonging to new categories. To resolve these challenges, we develop a novel Hierarchical Visual Prompt Learning (HVPL) model that overcomes catastrophic forgetting of previous categories from both frame-level and video-level perspectives. Specifically, to mitigate forgetting at the frame level, we devise a task-specific frame prompt and an orthogonal gradient correction (OGC) module. The OGC module helps the frame prompt encode task-specific global instance information for new classes in each individual frame by projecting its gradients onto the orthogonal feature space of old classes. Furthermore, to address forgetting at the video level, we design a task-specific video prompt and a video context decoder. This decoder first embeds structural inter-class relationships across frames into the frame prompt features, and then propagates task-specific global video contexts from the frame prompt features to the video prompt. Through rigorous comparisons, our HVPL model proves to be more effective than baseline approaches. The code is available at https://github.com/JiahuaDong/HVPL.
Problem

Research questions and friction points this paper is trying to address.

Overcoming catastrophic forgetting in video instance segmentation
Learning new object categories without losing old ones
Hierarchical approach for frame and video level forgetting
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Visual Prompt Learning for continual learning
Orthogonal Gradient Correction to prevent forgetting
Video context decoder for inter-class relationships
🔎 Similar Papers
2024-07-03arXiv.orgCitations: 0