Cluster-Wise Spatio-Temporal Masking for Efficient Video-Language Pretraining

📅 2026-03-24

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work addresses the limitations of existing video–language pretraining methods, which often lose critical visual information under high masking ratios and suffer from temporal information leakage due to inter-frame redundancy. To overcome these issues, the authors propose ClusterSTM, a novel approach that clusters visual tokens within each frame into semantically independent groups and retains only the temporally densest token from each cluster for masked modeling. Additionally, a video–text correlation reconstruction objective is introduced to align high-level multimodal semantics. ClusterSTM establishes the first cluster-level spatiotemporal masking mechanism, enhancing temporal consistency while preserving video content integrity—surpassing conventional pixel-level reconstruction. Experimental results demonstrate state-of-the-art performance across multiple benchmarks, including video–text retrieval, video question answering, and video captioning, setting a new standard for efficient video–language pretraining.

Technology Category

Application Category

📝 Abstract

Large-scale video-language pretraining enables strong generalization across multimodal tasks but often incurs prohibitive computational costs. Although recent advances in masked visual modeling help mitigate this issue, they still suffer from two fundamental limitations: severe visual information loss under high masking ratios and temporal information leakage caused by inter-frame correlations. To address these challenges, we propose ClusterSTM, a Cluster-Wise Spatio-Temporal Masking strategy for efficient video-language pretraining. ClusterSTM first performs intra-frame clustering to partition visual tokens into multiple semantically independent clusters, then conducts cluster-wise masking by retaining the token with the highest temporal density within each cluster. Our masking strategy ensure that the retained tokens capture holistic video content while exhibit strong temporal correlation. Additionally, we introduce a video-text relevance reconstruction objective that aligns high-level multimodal semantics beyond conventional visual reconstruction. Extensive experiments across multiple benchmarks demonstrate that ClusterSTM achieves superior performance on video-text retrieval, video question answering, and video captioning tasks, establishing a new state-of-the-art among efficient video-language models.

Problem

Research questions and friction points this paper is trying to address.

video-language pretraining

computational cost

visual information loss

temporal information leakage

masked visual modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cluster-wise masking

Spatio-temporal modeling

Video-language pretraining