🤖 AI Summary
Existing implicit neural representation (INR)-based video compression methods struggle to model detail-rich and highly dynamic content, primarily due to insufficient feature utilization and the lack of explicit modeling of video temporal structure. To address this, we propose MSNeRV, a multi-scale neural video compression framework. MSNeRV introduces two key innovations: (1) GoP-level grid-based background modeling coupled with multi-scale implicit feature reuse, jointly exploiting spatiotemporal dependencies, multi-frequency information, and multi-resolution decoding; and (2) GoP-level temporal window encoding and a scale-adaptive loss function to enhance dynamic representation capability. Evaluated on HEVC Class B and UVG datasets, MSNeRV consistently outperforms state-of-the-art INR-based methods. Notably, on dynamic scenes, it achieves higher PSNR than VTM-23.7 (RA configuration) and delivers superior rate-distortion performance.
📝 Abstract
Implicit Neural representations (INRs) have emerged as a promising approach for video compression, and have achieved comparable performance to the state-of-the-art codecs such as H.266/VVC. However, existing INR-based methods struggle to effectively represent detail-intensive and fast-changing video content. This limitation mainly stems from the underutilization of internal network features and the absence of video-specific considerations in network design. To address these challenges, we propose a multi-scale feature fusion framework, MSNeRV, for neural video representation. In the encoding stage, we enhance temporal consistency by employing temporal windows, and divide the video into multiple Groups of Pictures (GoPs), where a GoP-level grid is used for background representation. Additionally, we design a multi-scale spatial decoder with a scale-adaptive loss function to integrate multi-resolution and multi-frequency information. To further improve feature extraction, we introduce a multi-scale feature block that fully leverages hidden features. We evaluate MSNeRV on HEVC ClassB and UVG datasets for video representation and compression. Experimental results demonstrate that our model exhibits superior representation capability among INR-based approaches and surpasses VTM-23.7 (Random Access) in dynamic scenarios in terms of compression efficiency.