Cluster-Wise Spatio-Temporal Masking for Efficient Video-Language Pretraining

📅 2026-03-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing video–language pretraining methods, which often lose critical visual information under high masking ratios and suffer from temporal information leakage due to inter-frame redundancy. To overcome these issues, the authors propose ClusterSTM, a novel approach that clusters visual tokens within each frame into semantically independent groups and retains only the temporally densest token from each cluster for masked modeling. Additionally, a video–text correlation reconstruction objective is introduced to align high-level multimodal semantics. ClusterSTM establishes the first cluster-level spatiotemporal masking mechanism, enhancing temporal consistency while preserving video content integrity—surpassing conventional pixel-level reconstruction. Experimental results demonstrate state-of-the-art performance across multiple benchmarks, including video–text retrieval, video question answering, and video captioning, setting a new standard for efficient video–language pretraining.

Technology Category

Application Category

📝 Abstract
Large-scale video-language pretraining enables strong generalization across multimodal tasks but often incurs prohibitive computational costs. Although recent advances in masked visual modeling help mitigate this issue, they still suffer from two fundamental limitations: severe visual information loss under high masking ratios and temporal information leakage caused by inter-frame correlations. To address these challenges, we propose ClusterSTM, a Cluster-Wise Spatio-Temporal Masking strategy for efficient video-language pretraining. ClusterSTM first performs intra-frame clustering to partition visual tokens into multiple semantically independent clusters, then conducts cluster-wise masking by retaining the token with the highest temporal density within each cluster. Our masking strategy ensure that the retained tokens capture holistic video content while exhibit strong temporal correlation. Additionally, we introduce a video-text relevance reconstruction objective that aligns high-level multimodal semantics beyond conventional visual reconstruction. Extensive experiments across multiple benchmarks demonstrate that ClusterSTM achieves superior performance on video-text retrieval, video question answering, and video captioning tasks, establishing a new state-of-the-art among efficient video-language models.
Problem

Research questions and friction points this paper is trying to address.

video-language pretraining
computational cost
visual information loss
temporal information leakage
masked visual modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cluster-wise masking
Spatio-temporal modeling
Video-language pretraining
Temporal correlation
Multimodal semantics
🔎 Similar Papers