Augmented Deep Contexts for Spatially Embedded Video Coding

📅 2025-05-08

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

Existing neural video coders (NVCs) rely solely on temporal references, leading to insufficient contextual modeling and misaligned latent priors—particularly degrading performance under large motions or novel object appearances. To address these limitations, we propose the Spatially Embedded Video Coder (SEVC), the first NVC to incorporate low-resolution spatial references alongside temporal ones, enabling joint spatiotemporal reference utilization for improved motion vector estimation and hybrid context modeling. SEVC further introduces a spatially guided multi-temporal latent prior to mitigate prior misalignment. It jointly optimizes rate-distortion trade-offs and supports quality-adaptive bit-rate allocation. Notably, SEVC generates dual-bitstreams: a primary stream and an auxiliary low-resolution stream for enhanced flexibility. Experimental results demonstrate that SEVC achieves an average 11.9% bitrate reduction over state-of-the-art methods while significantly improving reconstruction quality in challenging large-motion and novel-object scenarios.

Technology Category

Application Category

📝 Abstract

Most Neural Video Codecs (NVCs) only employ temporal references to generate temporal-only contexts and latent prior. These temporal-only NVCs fail to handle large motions or emerging objects due to limited contexts and misaligned latent prior. To relieve the limitations, we propose a Spatially Embedded Video Codec (SEVC), in which the low-resolution video is compressed for spatial references. Firstly, our SEVC leverages both spatial and temporal references to generate augmented motion vectors and hybrid spatial-temporal contexts. Secondly, to address the misalignment issue in latent prior and enrich the prior information, we introduce a spatial-guided latent prior augmented by multiple temporal latent representations. At last, we design a joint spatial-temporal optimization to learn quality-adaptive bit allocation for spatial references, further boosting rate-distortion performance. Experimental results show that our SEVC effectively alleviates the limitations in handling large motions or emerging objects, and also reduces 11.9% more bitrate than the previous state-of-the-art NVC while providing an additional low-resolution bitstream. Our code and model are available at https://github.com/EsakaK/SEVC.

Problem

Research questions and friction points this paper is trying to address.

Handles large motions and emerging objects in video coding

Generates hybrid spatial-temporal contexts for improved compression

Optimizes bit allocation for spatial references to enhance performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses spatial and temporal references for motion vectors

Introduces spatial-guided latent prior with temporal augmentations

Implements joint spatial-temporal optimization for bit allocation

🔎 Similar Papers

Deep Common Feature Mining for Efficient Video Semantic Segmentation