🤖 AI Summary
Existing neural video representation (NVR) methods suffer from two fundamental bottlenecks: limited model capacity and high computational overhead, hindering both compression performance and deployment flexibility. To address this, we propose an online structural reparameterization framework featuring a lightweight Enhanced Reparameterizable Block (ERB) and an online fusion strategy. During training, multi-branch convolutions dynamically expand model capacity; at inference, the network is equivalently converted into a single-branch structure, thus preserving expressive power while ensuring efficiency. Our method requires no modification to the backbone architecture and is readily integrable into existing pipelines. Evaluated on mainstream video datasets, it achieves PSNR gains of 0.37–2.7 dB over baselines, with comparable training time and decoding speed. This approach effectively breaks the conventional capacity–efficiency trade-off frontier in NVR.
📝 Abstract
Neural Video Representation~(NVR) is a promising paradigm for video compression, showing great potential in improving video storage and transmission efficiency. While recent advances have made efforts in architectural refinements to improve representational capability, these methods typically involve complex designs, which may incur increased computational overhead and lack the flexibility to integrate into other frameworks. Moreover, the inherent limitation in model capacity restricts the expressiveness of NVR networks, resulting in a performance bottleneck. To overcome these limitations, we propose Online-RepNeRV, a NVR framework based on online structural reparameterization. Specifically, we propose a universal reparameterization block named ERB, which incorporates multiple parallel convolutional paths to enhance the model capacity. To mitigate the overhead, an online reparameterization strategy is adopted to dynamically fuse the parameters during training, and the multi-branch structure is equivalently converted into a single-branch structure after training. As a result, the additional computational and parameter complexity is confined to the encoding stage, without affecting the decoding efficiency. Extensive experiments on mainstream video datasets demonstrate that our method achieves an average PSNR gain of 0.37-2.7 dB over baseline methods, while maintaining comparable training time and decoding speed.