🤖 AI Summary
This work addresses the challenge in video semantic segmentation where state space models (SSMs) often suffer from loss of pixel-level spatiotemporal details due to fixed-size state compression, leading to inconsistent predictions and reduced accuracy. To mitigate this issue, the authors propose RS-SSM, a novel approach that incorporates a Channel-wise Amplitude Perceiver (CwAP) to capture salient feature distributions and introduces a Forget-Gate Information Refiner (FGIR) to adaptively reconstruct and optimize the forget-gate matrix, thereby effectively recovering compressed spatiotemporal information. Extensive experiments demonstrate that RS-SSM achieves state-of-the-art performance across four mainstream video semantic segmentation benchmarks while maintaining high computational efficiency, significantly enhancing both pixel-level accuracy and temporal consistency.
📝 Abstract
Recently, state space models have demonstrated efficient video segmentation through linear-complexity state space compression. However, Video Semantic Segmentation (VSS) requires pixel-level spatiotemporal modeling capabilities to maintain temporal consistency in segmentation of semantic objects. While state space models can preserve common semantic information during state space compression, the fixed-size state space inevitably forgets specific information, which limits the models' capability for pixel-level segmentation. To tackle the above issue, we proposed a Refining Specifics State Space Model approach (RS-SSM) for video semantic segmentation, which performs complementary refining of forgotten spatiotemporal specifics. Specifically, a Channel-wise Amplitude Perceptron (CwAP) is designed to extract and align the distribution characteristics of specific information in the state space. Besides, a Forgetting Gate Information Refiner (FGIR) is proposed to adaptively invert and refine the forgetting gate matrix in the state space model based on the specific information distribution. Consequently, our RS-SSM leverages the inverted forgetting gate to complementarily refine the specific information forgotten during state space compression, thereby enhancing the model's capability for spatiotemporal pixel-level segmentation. Extensive experiments on four VSS benchmarks demonstrate that our RS-SSM achieves state-of-the-art performance while maintaining high computational efficiency. The code is available at https://github.com/zhoujiahuan1991/CVPR2026-RS-SSM.