Revisiting Continual Semantic Segmentation with Pre-trained Vision Models

📅 2025-08-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In continual semantic segmentation (CSS), catastrophic forgetting is commonly attributed to backbone fine-tuning; however, this work reveals that pretrained vision models inherently exhibit strong forgetting resistance, and classifier drift—not backbone adaptation—is the primary cause of forgetting. To address this, we propose DFT*, a novel framework that freezes both the pretrained backbone and all previously learned classifiers, allocating and fine-tuning only new-class classifiers in advance. DFT* abandons conventional feature-space retraining, drastically reducing parameter count and computational overhead. Evaluated across eight CSS settings on Pascal VOC 2012 and ADE20K, DFT* consistently outperforms 16 state-of-the-art methods, achieving substantial average mIoU gains. It reduces trainable parameters by ~40% and training time by over 30%, empirically validating the “freeze-over-finetune” paradigm as a more effective and efficient approach to CSS.

Technology Category

Application Category

📝 Abstract
Continual Semantic Segmentation (CSS) seeks to incrementally learn to segment novel classes while preserving knowledge of previously encountered ones. Recent advancements in CSS have been largely driven by the adoption of Pre-trained Vision Models (PVMs) as backbones. Among existing strategies, Direct Fine-Tuning (DFT), which sequentially fine-tunes the model across classes, remains the most straightforward approach. Prior work often regards DFT as a performance lower bound due to its presumed vulnerability to severe catastrophic forgetting, leading to the development of numerous complex mitigation techniques. However, we contend that this prevailing assumption is flawed. In this paper, we systematically revisit forgetting in DFT across two standard benchmarks, Pascal VOC 2012 and ADE20K, under eight CSS settings using two representative PVM backbones: ResNet101 and Swin-B. Through a detailed probing analysis, our findings reveal that existing methods significantly underestimate the inherent anti-forgetting capabilities of PVMs. Even under DFT, PVMs retain previously learned knowledge with minimal forgetting. Further investigation of the feature space indicates that the observed forgetting primarily arises from the classifier's drift away from the PVM, rather than from degradation of the backbone representations. Based on this insight, we propose DFT*, a simple yet effective enhancement to DFT that incorporates strategies such as freezing the PVM backbone and previously learned classifiers, as well as pre-allocating future classifiers. Extensive experiments show that DFT* consistently achieves competitive or superior performance compared to sixteen state-of-the-art CSS methods, while requiring substantially fewer trainable parameters and less training time.
Problem

Research questions and friction points this paper is trying to address.

Reassessing catastrophic forgetting in continual semantic segmentation
Evaluating pre-trained vision models' anti-forgetting capabilities
Proposing enhanced direct fine-tuning to mitigate classifier drift
Innovation

Methods, ideas, or system contributions that make the work stand out.

Revisiting Direct Fine-Tuning for CSS
Freezing PVM backbone and classifiers
Pre-allocating future classifiers
🔎 Similar Papers
No similar papers found.
Duzhen Zhang
Duzhen Zhang
Institute of Automation, Chinese Academy of Sciences
Natural Language ProcessingMultimodalLarge Language ModelsContinual LearningAI4Science
Yong Ren
Yong Ren
Institute of Automation, Chinese Academy of Sciences
Speech CodecText-to-speechVideo-to-audioMLLMContinual Learning
W
Wei Cong
Shenyang Institute of Automation, Chinese Academy of Sciences, Shenyang, China
Junhao Zheng
Junhao Zheng
South China University of Technology, Qwen Team
Large Language ModelsPretrainingContinual Learning
Q
Qiaoyi Su
Institute of Automation, Chinese Academy of Sciences, Beijing, China
S
Shuncheng Jia
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Z
Zhong-Zhi Li
Institute of Automation, Chinese Academy of Sciences, Beijing, China
X
Xuanle Zhao
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Y
Ye Bai
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Feilong Chen
Feilong Chen
Huawei Inc.; Previously CASIA
(Native) Multimodal LLMMultimodal GenerationMultimodal ReasoningOmni-modal LLM
Q
Qi Tian
Huawei Inc., Shenzhen, Guangdong, China
Tielin Zhang
Tielin Zhang
Chinese Academy of Sciences
Spiking Neural NetworksCognitive ComputationComputational Neuroscience