SMC++: Masked Learning of Unsupervised Video Semantic Compression

📅 2024-06-07
🏛️ arXiv.org
📈 Citations: 8
Influential: 1
📄 PDF
🤖 AI Summary
Traditional video compression methods prioritize human visual perception over semantic fidelity, degrading downstream analysis performance. To address this, we propose an end-to-end semantic-preserving video compression framework. Methodologically: (1) we introduce self-supervised masked video modeling (MVM) to jointly learn spatiotemporal semantics; (2) we propose a novel semantic entropy regularization mechanism to suppress non-semantic noise; (3) we design a masked motion prediction objective to strengthen temporal semantic modeling; and (4) we construct a compact blueprint semantic representation to align heterogeneous features and fully exploit Transformer capacity. Extensive experiments across three downstream tasks—object detection, action recognition, and tracking—on seven benchmark datasets demonstrate that our method significantly outperforms conventional, learned, and perception-optimized codecs, achieving synergistic improvements in both compression efficiency and semantic fidelity.

Technology Category

Application Category

📝 Abstract
Most video compression methods focus on human visual perception, neglecting semantic preservation. This leads to severe semantic loss during the compression, hampering downstream video analysis tasks. In this paper, we propose a Masked Video Modeling (MVM)-powered compression framework that particularly preserves video semantics, by jointly mining and compressing the semantics in a self-supervised manner. While MVM is proficient at learning generalizable semantics through the masked patch prediction task, it may also encode non-semantic information like trivial textural details, wasting bitcost and bringing semantic noises. To suppress this, we explicitly regularize the non-semantic entropy of the compressed video in the MVM token space. The proposed framework is instantiated as a simple Semantic-Mining-then-Compression (SMC) model. Furthermore, we extend SMC as an advanced SMC++ model from several aspects. First, we equip it with a masked motion prediction objective, leading to better temporal semantic learning ability. Second, we introduce a Transformer-based compression module, to improve the semantic compression efficacy. Considering that directly mining the complex redundancy among heterogeneous features in different coding stages is non-trivial, we introduce a compact blueprint semantic representation to align these features into a similar form, fully unleashing the power of the Transformer-based compression module. Extensive results demonstrate the proposed SMC and SMC++ models show remarkable superiority over previous traditional, learnable, and perceptual quality-oriented video codecs, on three video analysis tasks and seven datasets. extit{Codes and model are available at: url{https://github.com/tianyuan168326/VideoSemanticCompression-Pytorch}.
Problem

Research questions and friction points this paper is trying to address.

Preserving video semantics during compression for analysis tasks
Reducing non-semantic information to optimize bitrate usage
Aligning heterogeneous features for effective semantic compression
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses masked video modeling for semantic compression
Regularizes non-semantic entropy in token space
Introduces blueprint representation for feature alignment
🔎 Similar Papers
No similar papers found.