🤖 AI Summary
This work addresses the limitation of existing implicit neural video representations in effectively leveraging the complementary strengths of neural networks and learnable grids for modeling structured content versus fine-grained details. The authors propose a hybrid framework that, for the first time, explicitly decouples video information into regular and irregular components from a representational perspective. Specifically, they introduce a coupled WarpRNN module to explicitly model structured elements such as motion and geometry, while employing a hybrid residual grid to jointly represent irregular appearance and motion details. This synergistic architecture enables multi-scale motion compensation and efficient implicit representation. Evaluated on the UVG dataset, the method achieves an average PSNR of 33.73 dB with only 3M parameters, outperforming current approaches in reconstruction quality and demonstrating superior generalization across multiple downstream tasks.
📝 Abstract
Implicit Neural Video Representation (INVR) has emerged as a novel approach for video representation and compression, using learnable grids and neural networks. Existing methods focus on developing new grid structures efficient for latent representation and neural network architectures with large representation capability, lacking the study on their roles in video representation. In this paper, the difference between INVR based on neural network and INVR based on grid is first investigated from the perspective of video information composition to specify their own advantages, i.e., neural network for general structure while grid for specific detail. Accordingly, an INVR based on mixed neural network and residual grid framework is proposed, where the neural network is used to represent the regular and structured information and the residual grid is used to represent the remaining irregular information in a video. A Coupled WarpRNN-based multi-scale motion representation and compensation module is specifically designed to explicitly represent the regular and structured information, thus terming our method as CWRNN-INVR. For the irregular information, a mixed residual grid is learned where the irregular appearance and motion information are represented together. The mixed residual grid can be combined with the coupled WarpRNN in a way that allows for network reuse. Experiments show that our method achieves the best reconstruction results compared with the existing methods, with an average PSNR of 33.73 dB on the UVG dataset under the 3M model and outperforms existing INVR methods in other downstream tasks. The code can be found at https://github.com/yiyang-sdu/CWRNN-INVR.git}{https://github.com/yiyang-sdu/CWRNN-INVR.git.