State Space Prompting via Gathering and Spreading Spatio-Temporal Information for Video Understanding

📅 2025-10-14

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Existing video prompt learning methods applied to pretrained state space models (SSMs) suffer from sequential compression mechanisms that hinder effective modeling of intra-frame spatial and inter-frame temporal contexts, resulting in insufficient discriminative spatiotemporal feature extraction. To address this, we propose State Space Prompting (SSP), a novel prompting framework featuring two complementary modules: Intra-Frame Gathering, which adaptively aggregates salient spatial structures within each frame, and Inter-Frame Spreading, which propagates temporal dependencies across frames. These modules jointly enable spatiotemporal co-modeling within lightweight, learnable prompt tokens. Crucially, SSP only fine-tunes the prompt parameters—leaving the frozen SSM backbone intact—thereby drastically reducing computational overhead. Extensive experiments demonstrate that SSP achieves an average 2.76% improvement over prior state-of-the-art methods across four mainstream video classification benchmarks, validating its efficiency, effectiveness, and strong generalization capability.

Technology Category

Application Category

📝 Abstract

Recently, pre-trained state space models have shown great potential for video classification, which sequentially compresses visual tokens in videos with linear complexity, thereby improving the processing efficiency of video data while maintaining high performance. To apply powerful pre-trained models to downstream tasks, prompt learning is proposed to achieve efficient downstream task adaptation with only a small number of fine-tuned parameters. However, the sequentially compressed visual prompt tokens fail to capture the spatial and temporal contextual information in the video, thus limiting the effective propagation of spatial information within a video frame and temporal information between frames in the state compression model and the extraction of discriminative information. To tackle the above issue, we proposed a State Space Prompting (SSP) method for video understanding, which combines intra-frame and inter-frame prompts to aggregate and propagate key spatiotemporal information in the video. Specifically, an Intra-Frame Gathering (IFG) module is designed to aggregate spatial key information within each frame. Besides, an Inter-Frame Spreading (IFS) module is designed to spread discriminative spatio-temporal information across different frames. By adaptively balancing and compressing key spatio-temporal information within and between frames, our SSP effectively propagates discriminative information in videos in a complementary manner. Extensive experiments on four video benchmark datasets verify that our SSP significantly outperforms existing SOTA methods by 2.76% on average while reducing the overhead of fine-tuning parameters.

Problem

Research questions and friction points this paper is trying to address.

Capturing spatiotemporal context in video prompts

Propagating spatial and temporal information effectively

Improving discriminative information extraction in state space models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Intra-Frame Gathering module aggregates spatial key information

Inter-Frame Spreading module propagates spatio-temporal information across frames

State Space Prompting balances and compresses key spatio-temporal information

🔎 Similar Papers

Open-Vocabulary Action Localization With Iterative Visual Prompting

2024-08-30IEEE AccessCitations: 0

Netflix

The overall market range for Netflix Internships is typically $40/hour - $110/hour.

Los Gatos, CA, USA / Los Angeles, CA, USA

AI Research Scientist, Computer Vision - Facebook Video Intelligence