MSNeRV: Neural Video Representation with Multi-Scale Feature Fusion

📅 2025-06-18

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

Existing implicit neural representation (INR)-based video compression methods struggle to model detail-rich and highly dynamic content, primarily due to insufficient feature utilization and the lack of explicit modeling of video temporal structure. To address this, we propose MSNeRV, a multi-scale neural video compression framework. MSNeRV introduces two key innovations: (1) GoP-level grid-based background modeling coupled with multi-scale implicit feature reuse, jointly exploiting spatiotemporal dependencies, multi-frequency information, and multi-resolution decoding; and (2) GoP-level temporal window encoding and a scale-adaptive loss function to enhance dynamic representation capability. Evaluated on HEVC Class B and UVG datasets, MSNeRV consistently outperforms state-of-the-art INR-based methods. Notably, on dynamic scenes, it achieves higher PSNR than VTM-23.7 (RA configuration) and delivers superior rate-distortion performance.

Technology Category

Application Category

📝 Abstract

Implicit Neural representations (INRs) have emerged as a promising approach for video compression, and have achieved comparable performance to the state-of-the-art codecs such as H.266/VVC. However, existing INR-based methods struggle to effectively represent detail-intensive and fast-changing video content. This limitation mainly stems from the underutilization of internal network features and the absence of video-specific considerations in network design. To address these challenges, we propose a multi-scale feature fusion framework, MSNeRV, for neural video representation. In the encoding stage, we enhance temporal consistency by employing temporal windows, and divide the video into multiple Groups of Pictures (GoPs), where a GoP-level grid is used for background representation. Additionally, we design a multi-scale spatial decoder with a scale-adaptive loss function to integrate multi-resolution and multi-frequency information. To further improve feature extraction, we introduce a multi-scale feature block that fully leverages hidden features. We evaluate MSNeRV on HEVC ClassB and UVG datasets for video representation and compression. Experimental results demonstrate that our model exhibits superior representation capability among INR-based approaches and surpasses VTM-23.7 (Random Access) in dynamic scenarios in terms of compression efficiency.

Problem

Research questions and friction points this paper is trying to address.

Enhance detail-intensive video representation with multi-scale fusion

Improve temporal consistency in fast-changing video content

Optimize neural video compression surpassing VTM-23.7 efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-scale feature fusion for video representation

Temporal windows enhance encoding consistency

Scale-adaptive loss integrates multi-resolution info

🔎 Similar Papers

MUSE: Mamba is Efficient Multi-scale Learner for Text-video Retrieval