Sliding Window Attention for Learned Video Compression

📅 2025-10-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Conventional transformer-based video compression suffers from irregular receptive fields due to non-overlapping block partitioning and redundant computation caused by window overlap. Method: This paper proposes a 3D sliding-window attention mechanism enabling block-free, uniformly covered spatiotemporal local modeling. It adopts a decoder-only architecture integrating temporal autoregressive modeling and a lightweight entropy model, preserving long-range contextual modeling while significantly improving inference efficiency. Contribution/Results: By replacing fixed blocks with sliding windows, the method eliminates overlap-induced redundancy, unifies the decoder structure, and ensures consistent receptive fields across all positions. Experiments show that, compared to the VCT baseline, our approach achieves an 18.6% BD-rate reduction, a 2.8× decrease in decoding complexity, and a 3.5× increase in entropy coding throughput—jointly optimizing rate-distortion performance and computational efficiency.

Technology Category

Application Category

📝 Abstract
To manage the complexity of transformers in video compression, local attention mechanisms are a practical necessity. The common approach of partitioning frames into patches, however, creates architectural flaws like irregular receptive fields. When adapted for temporal autoregressive models, this paradigm, exemplified by the Video Compression Transformer (VCT), also necessitates computationally redundant overlapping windows. This work introduces 3D Sliding Window Attention (SWA), a patchless form of local attention. By enabling a decoder-only architecture that unifies spatial and temporal context processing, and by providing a uniform receptive field, our method significantly improves rate-distortion performance, achieving Bjørntegaard Delta-rate savings of up to 18.6 % against the VCT baseline. Simultaneously, by eliminating the need for overlapping windows, our method reduces overall decoder complexity by a factor of 2.8, while its entropy model is nearly 3.5 times more efficient. We further analyze our model's behavior and show that while it benefits from long-range temporal context, excessive context can degrade performance.
Problem

Research questions and friction points this paper is trying to address.

Improving video compression rate-distortion performance
Reducing decoder complexity in transformer architectures
Eliminating redundant overlapping windows in temporal models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces 3D Sliding Window Attention for video compression
Unifies spatial and temporal context in decoder-only architecture
Eliminates overlapping windows to reduce computational complexity
🔎 Similar Papers
No similar papers found.
A
Alexander Kopte
Multimedia Communications and Signal Processing, Friedrich-Alexander University Erlangen-Nuremberg, Erlangen, Germany
André Kaup
André Kaup
Professor, Friedrich-Alexander University Erlangen-Nuremberg
Image and Video CodingMultimedia Signal Processing