Context Guided Transformer Entropy Modeling for Video Compression

๐Ÿ“… 2025-08-03
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
In video compression, existing conditional entropy models face two key challenges: (1) incorporating temporal context significantly increases computational overhead; and (2) spatial context modeling lacks explicit dependency ordering, limiting context availability during decoding. This paper proposes a Context-Guided Transformer Entropy Model that jointly addresses both issues. It introduces temporal context resampling to reduce redundancy and dependency-weighted spatial context modeling to explicitly encode spatial dependency order. Furthermore, we design an attention-guided teacherโ€“student network and a top-k dependency selection mechanism to enhance contextual prioritization and efficiency. Experiments demonstrate that our model reduces entropy modeling latency by 65% compared to state-of-the-art methods, achieves a 11% BD-Rate improvement, and maintains high compression performance while substantially improving inference efficiency and decoder friendliness.

Technology Category

Application Category

๐Ÿ“ Abstract
Conditional entropy models effectively leverage spatio-temporal contexts to reduce video redundancy. However, incorporating temporal context often introduces additional model complexity and increases computational cost. In parallel, many existing spatial context models lack explicit modeling the ordering of spatial dependencies, which may limit the availability of relevant context during decoding. To address these issues, we propose the Context Guided Transformer (CGT) entropy model, which estimates probability mass functions of the current frame conditioned on resampled temporal context and dependency-weighted spatial context. A temporal context resampler learns predefined latent queries to extract critical temporal information using transformer encoders, reducing downstream computational overhead. Meanwhile, a teacher-student network is designed as dependency-weighted spatial context assigner to explicitly model the dependency of spatial context order. The teacher generates an attention map to represent token importance and an entropy map to reflect prediction certainty from randomly masked inputs, guiding the student to select the weighted top-k tokens with the highest spatial dependency. During inference, only the student is used to predict undecoded tokens based on high-dependency context. Experimental results demonstrate that our CGT model reduces entropy modeling time by approximately 65% and achieves an 11% BD-Rate reduction compared to the previous state-of-the-art conditional entropy model.
Problem

Research questions and friction points this paper is trying to address.

Reducing video redundancy with efficient entropy modeling
Minimizing computational cost in temporal context integration
Improving spatial context modeling with dependency weighting
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based temporal context resampling for efficiency
Teacher-student network for spatial dependency weighting
Top-k token selection guided by attention and entropy
๐Ÿ”Ž Similar Papers
No similar papers found.
J
Junlong Tong
Shanghai Jiao Tong University
W
Wei Zhang
Ningbo Key Laboratory of Spatial Intelligence and Digital Derivative, Institute of Digital Twin, EIT
Yaohui Jin
Yaohui Jin
Shanghai Jiao Tong University
Xiaoyu Shen
Xiaoyu Shen
Eastern Institute of Technology, Ningbo
language modelmulti-modal learningreasoning