L-STEC: Learned Video Compression with Long-term Spatio-Temporal Enhanced Context

📅 2025-12-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing conditional neural video compression methods rely on single-frame feature prediction, leading to insufficient long-term temporal modeling and loss of texture details. To address this, we propose a long-range spatiotemporal enhancement framework for context modeling: (1) we introduce LSTM into neural video compression for the first time to explicitly capture long-term temporal dependencies; (2) we design an optical-flow-guided pixel-domain warped fusion mechanism to achieve deformation-aware spatial context alignment; and (3) we integrate multi-receptive-field networks to jointly fuse spatiotemporal features. An end-to-end differentiable codec architecture is built accordingly. Experimental results demonstrate that our method achieves state-of-the-art (SOTA) performance: under PSNR and MS-SSIM metrics, it reduces bitrates by 37.01% and 31.65%, respectively, compared to DCVC-TCM, and consistently outperforms both VTM-17.0 and DCVC-FM.

Technology Category

Application Category

📝 Abstract
Neural Video Compression has emerged in recent years, with condition-based frameworks outperforming traditional codecs. However, most existing methods rely solely on the previous frame's features to predict temporal context, leading to two critical issues. First, the short reference window misses long-term dependencies and fine texture details. Second, propagating only feature-level information accumulates errors over frames, causing prediction inaccuracies and loss of subtle textures. To address these, we propose the Long-term Spatio-Temporal Enhanced Context (L-STEC) method. We first extend the reference chain with LSTM to capture long-term dependencies. We then incorporate warped spatial context from the pixel domain, fusing spatio-temporal information through a multi-receptive field network to better preserve reference details. Experimental results show that L-STEC significantly improves compression by enriching contextual information, achieving 37.01% bitrate savings in PSNR and 31.65% in MS-SSIM compared to DCVC-TCM, outperforming both VTM-17.0 and DCVC-FM and establishing new state-of-the-art performance.
Problem

Research questions and friction points this paper is trying to address.

Enhances long-term dependencies in video compression
Reduces error accumulation across video frames
Improves preservation of fine texture details
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends reference chain with LSTM for long-term dependencies
Incorporates warped spatial context from pixel domain
Fuses spatio-temporal information via multi-receptive field network
🔎 Similar Papers
No similar papers found.
T
Tiange Zhang
Shenzhen Graduate School, Peking University, Shenzhen, China
Zhimeng Huang
Zhimeng Huang
Peking University
Video CodingFeature CodingCoding for MachinesSignal Processing
Xiandong Meng
Xiandong Meng
University of California Davis
Natural Language Processing LLM Deep Learning
K
Kai Zhang
Bytedance Inc, San Diego, USA
Z
Zhipin Deng
Bytedance Inc, San Diego, USA
S
Siwei Ma
Shenzhen Graduate School, Peking University, Shenzhen, China