L-STEC: Learned Video Compression with Long-term Spatio-Temporal Enhanced Context

📅 2025-12-14

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Existing conditional neural video compression methods rely on single-frame feature prediction, leading to insufficient long-term temporal modeling and loss of texture details. To address this, we propose a long-range spatiotemporal enhancement framework for context modeling: (1) we introduce LSTM into neural video compression for the first time to explicitly capture long-term temporal dependencies; (2) we design an optical-flow-guided pixel-domain warped fusion mechanism to achieve deformation-aware spatial context alignment; and (3) we integrate multi-receptive-field networks to jointly fuse spatiotemporal features. An end-to-end differentiable codec architecture is built accordingly. Experimental results demonstrate that our method achieves state-of-the-art (SOTA) performance: under PSNR and MS-SSIM metrics, it reduces bitrates by 37.01% and 31.65%, respectively, compared to DCVC-TCM, and consistently outperforms both VTM-17.0 and DCVC-FM.

Technology Category

Application Category

📝 Abstract

Neural Video Compression has emerged in recent years, with condition-based frameworks outperforming traditional codecs. However, most existing methods rely solely on the previous frame's features to predict temporal context, leading to two critical issues. First, the short reference window misses long-term dependencies and fine texture details. Second, propagating only feature-level information accumulates errors over frames, causing prediction inaccuracies and loss of subtle textures. To address these, we propose the Long-term Spatio-Temporal Enhanced Context (L-STEC) method. We first extend the reference chain with LSTM to capture long-term dependencies. We then incorporate warped spatial context from the pixel domain, fusing spatio-temporal information through a multi-receptive field network to better preserve reference details. Experimental results show that L-STEC significantly improves compression by enriching contextual information, achieving 37.01% bitrate savings in PSNR and 31.65% in MS-SSIM compared to DCVC-TCM, outperforming both VTM-17.0 and DCVC-FM and establishing new state-of-the-art performance.

Problem

Research questions and friction points this paper is trying to address.

Enhances long-term dependencies in video compression

Reduces error accumulation across video frames

Improves preservation of fine texture details

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends reference chain with LSTM for long-term dependencies

Incorporates warped spatial context from pixel domain

Fuses spatio-temporal information via multi-receptive field network

🔎 Similar Papers

No similar papers found.