🤖 AI Summary
This work addresses key limitations in existing video summarization methods—namely their reliance on manual annotations, poor cross-domain generalization, and high computational cost—while also overcoming the inability of unsupervised approaches to effectively model long-range temporal dependencies and semantic structure. To this end, we propose TRIMMER, a novel framework that leverages self-supervised learning for video representation and integrates reinforcement learning for spatiotemporal summary decisions. TRIMMER introduces an information-theoretic reward function that directly optimizes frame selection by measuring entropy to capture high-order temporal dynamics and semantic diversity, eschewing conventional similarity-based objectives. Extensive experiments demonstrate that TRIMMER achieves state-of-the-art performance among unsupervised and self-supervised methods across multiple standard benchmarks, matching or surpassing leading supervised approaches while significantly improving efficiency, representational capacity, and generalization.
📝 Abstract
The rapid growth of video content across domains such as surveillance, education, and social media has made efficient content understanding increasingly critical. Video summarization addresses this challenge by generating concise yet semantically meaningful representations, but existing approaches often rely on expensive manual annotations, struggle to generalize across domains, and incur significant computational costs due to complex architectures. Moreover, unsupervised and weakly supervised methods typically underperform compared to supervised counterparts in capturing long-range temporal dependencies and semantic structure. In this work, we propose TRIMMER (Temporal Relative Information Maximization for Multi-objective Efficient Reinforcement), a novel self-supervised reinforcement learning framework for video summarization. TRIMMER operates in two stages: it first learns robust representations via self-supervised learning and then performs spatio-temporal decision making through reinforcement learning guided by information-theoretic reward functions. Unlike prior approaches that rely on similarity-based objectives, our method introduces entropy-based metrics to capture higher-order temporal dynamics and semantic diversity, while computing rewards directly over selected frame indices to improve computational efficiency. Extensive experiments on standard benchmarks demonstrate that TRIMMER achieves state-of-the-art performance among unsupervised and self-supervised methods, while remaining competitive with leading supervised approaches, highlighting its effectiveness for scalable and generalizable video summarization.