🤖 AI Summary
This work addresses the limitations of micro-video popularity prediction, which suffers from restricted temporal receptive fields and inefficient utilization of historical correlations. To overcome these challenges, the authors propose a unified spatiotemporal joint expansion framework. The framework incorporates a frame-score-driven temporal expansion module that combines sparse sampling with dense perception to enhance long-sequence understanding. Additionally, it introduces a spatial memory bank based on hierarchical clustering of topological relationships, enabling efficient and scalable modeling of historical information while preventing unbounded memory growth. Evaluated on three mainstream benchmarks, the proposed method significantly outperforms eleven strong baselines, achieving consistent improvements in both prediction accuracy and ranking consistency.
📝 Abstract
Micro-video popularity prediction (MVPP) aims to forecast the future popularity of videos on online media, which is essential for applications such as content recommendation and traffic allocation. In real-world scenarios, it is critical for MVPP approaches to understand both the temporal dynamics of a given video (temporal) and its historical relevance to other videos (spatial). However, existing approaches sufer from limitations in both dimensions: temporally, they rely on sparse short-range sampling that restricts content perception; spatially, they depend on flat retrieval memory with limited capacity and low efficiency, hindering scalable knowledge utilization. To overcome these limitations, we propose a unified framework that achieves joint spatio-temporal enlargement, enabling precise perception of extremely long video sequences while supporting a scalable memory bank that can infinitely expand to incorporate all relevant historical videos. Technically, we employ a Temporal Enlargement driven by a frame scoring module that extracts highlight cues from video frames through two complementary pathways: sparse sampling and dense perception. Their outputs are adaptively fused to enable robust long-sequence content understanding. For Spatial Enlargement, we construct a Topology-Aware Memory Bank that hierarchically clusters historically relevant content based on topological relationships. Instead of directly expanding memory capacity, we update the encoder features of the corresponding clusters when incorporating new videos, enabling unbounded historical association without unbounded storage growth. Extensive experiments on three widely used MVPP benchmarks demonstrate that our method consistently outperforms 11 strong baselines across mainstream metrics, achieving robust improvements in both prediction accuracy and ranking consistency.