🤖 AI Summary
Existing neural video coding methods employ uniform temporal sampling, which struggles to model temporal redundancy effectively, thereby limiting rate-distortion performance. To address this, we propose a tree-structured implicit representation framework: (i) we introduce a binary search tree (BST) to hierarchically organize temporal features, enabling adaptive non-uniform sampling; (ii) we design a motion-complexity-driven dynamic sampling strategy; and (iii) we incorporate gradient-guided temporal importance assessment with differentiable optimization. Built upon the NeRV architecture, our method significantly improves compression efficiency and reconstruction quality. Extensive experiments demonstrate average PSNR gains of 1.2–2.8 dB and bitrate reductions of 37%–52% across multiple benchmarks, consistently outperforming state-of-the-art uniformly sampled neural video codecs.
📝 Abstract
Implicit Neural Representations for Videos (NeRV) have emerged as a powerful paradigm for video representation, enabling direct mappings from frame indices to video frames. However, existing NeRV-based methods do not fully exploit temporal redundancy, as they rely on uniform sampling along the temporal axis, leading to suboptimal rate-distortion (RD) performance. To address this limitation, we propose Tree-NeRV, a novel tree-structured feature representation for efficient and adaptive video encoding. Unlike conventional approaches, Tree-NeRV organizes feature representations within a Binary Search Tree (BST), enabling non-uniform sampling along the temporal axis. Additionally, we introduce an optimization-driven sampling strategy, dynamically allocating higher sampling density to regions with greater temporal variation. Extensive experiments demonstrate that Tree-NeRV achieves superior compression efficiency and reconstruction quality, outperforming prior uniform sampling-based methods. Code will be released.