State Space Prompting via Gathering and Spreading Spatio-Temporal Information for Video Understanding

📅 2025-10-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video prompt learning methods applied to pretrained state space models (SSMs) suffer from sequential compression mechanisms that hinder effective modeling of intra-frame spatial and inter-frame temporal contexts, resulting in insufficient discriminative spatiotemporal feature extraction. To address this, we propose State Space Prompting (SSP), a novel prompting framework featuring two complementary modules: Intra-Frame Gathering, which adaptively aggregates salient spatial structures within each frame, and Inter-Frame Spreading, which propagates temporal dependencies across frames. These modules jointly enable spatiotemporal co-modeling within lightweight, learnable prompt tokens. Crucially, SSP only fine-tunes the prompt parameters—leaving the frozen SSM backbone intact—thereby drastically reducing computational overhead. Extensive experiments demonstrate that SSP achieves an average 2.76% improvement over prior state-of-the-art methods across four mainstream video classification benchmarks, validating its efficiency, effectiveness, and strong generalization capability.

Technology Category

Application Category

📝 Abstract
Recently, pre-trained state space models have shown great potential for video classification, which sequentially compresses visual tokens in videos with linear complexity, thereby improving the processing efficiency of video data while maintaining high performance. To apply powerful pre-trained models to downstream tasks, prompt learning is proposed to achieve efficient downstream task adaptation with only a small number of fine-tuned parameters. However, the sequentially compressed visual prompt tokens fail to capture the spatial and temporal contextual information in the video, thus limiting the effective propagation of spatial information within a video frame and temporal information between frames in the state compression model and the extraction of discriminative information. To tackle the above issue, we proposed a State Space Prompting (SSP) method for video understanding, which combines intra-frame and inter-frame prompts to aggregate and propagate key spatiotemporal information in the video. Specifically, an Intra-Frame Gathering (IFG) module is designed to aggregate spatial key information within each frame. Besides, an Inter-Frame Spreading (IFS) module is designed to spread discriminative spatio-temporal information across different frames. By adaptively balancing and compressing key spatio-temporal information within and between frames, our SSP effectively propagates discriminative information in videos in a complementary manner. Extensive experiments on four video benchmark datasets verify that our SSP significantly outperforms existing SOTA methods by 2.76% on average while reducing the overhead of fine-tuning parameters.
Problem

Research questions and friction points this paper is trying to address.

Capturing spatiotemporal context in video prompts
Propagating spatial and temporal information effectively
Improving discriminative information extraction in state space models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Intra-Frame Gathering module aggregates spatial key information
Inter-Frame Spreading module propagates spatio-temporal information across frames
State Space Prompting balances and compresses key spatio-temporal information
🔎 Similar Papers
No similar papers found.