๐ค AI Summary
This work addresses the limitations of existing video compression methods, which struggle to efficiently exploit temporal redundancy in static scenes and are sensitive to train-test distribution shifts, while generative approaches often introduce spurious details. To overcome these issues, the authors propose a novel paradigm that models short-term temporal variations as forward excitation noise within a neural video compression framework. This approach explicitly decouples transient dynamics from the static background, enabling the model to internalize structural priors. During inference, only the invariant background needs to be transmitted at an extremely low bitrate, achieving a favorable trade-off between fidelity and compression efficiency. By introducing forward excitation noise into static-scene compression for the first time, the method attains a 73% Bjรธntegaard delta rate saving over general neural video compression, thereby supporting robust transmission under poor network conditions and cost-effective long-term storage of surveillance video.
๐ Abstract
Static scene videos, such as surveillance feeds and videotelephony streams, constitute a dominant share of storage consumption and network traffic. However, both traditional standardized codecs and neural video compression (NVC) methods struggle to encode these videos efficiently due to inadequate usage of temporal redundancy and severe distribution gaps between training and test data, respectively. While recent generative compression methods improve perceptual quality, they introduce hallucinated details that are unacceptable in authenticity-critical applications. To overcome these limitations, we propose to incorporate positive-incentive noise into NVC for static scene videos, where short-term temporal changes are reinterpreted as positive-incentive noise to facilitate model finetuning. By disentangling transient variations from the persistent background, structured prior information is internalized in the compression model. During inference, the invariant component requires minimal signaling, thus reducing data transmission while maintaining pixel-level fidelity. Preliminary experiments demonstrate a 73% Bj{\o}ntegaard delta (BD) rate saving compared to general NVC models. Our method provides an effective solution to trade computation for bandwidth, enabling robust video transmission under adverse network conditions and economic long-term retention of surveillance footage.