🤖 AI Summary
To address the challenge of simultaneously achieving precise quality control and energy efficiency in AV1 encoding for virtual production, this paper proposes a lightweight neural network method that— for the first time—integrates CLIP-based semantic embeddings, bitstream features, and video complexity metrics to enable end-to-end prediction of NVENC AV1 quantization parameters. The method significantly reduces computational overhead while guaranteeing target visual quality (VMAF), achieving an average VMAF prediction error below 1.2 and ensuring ≤2 error for 87% of samples—substantially improving upon prior approaches (61%) in both accuracy and robustness across diverse quality levels. The core innovation lies in incorporating multimodal semantic information into the classical rate-distortion optimization framework, establishing a novel paradigm for real-time, high-fidelity, and power-efficient on-set video encoding.
📝 Abstract
In the last decade, video workflows in the cinema production ecosystem have presented new use cases for video streaming technology. These new workflows, e.g. in On-set Virtual Production, present the challenge of requiring precise quality control and energy efficiency. Existing approaches to transcoding often fall short of these requirements, either due to a lack of quality control or computational overhead. To fill this gap, we present a lightweight neural network (LiteVPNet) for accurately predicting Quantisation Parameters for NVENC AV1 encoders that achieve a specified VMAF score. We use low-complexity features, including bitstream characteristics, video complexity measures, and CLIP-based semantic embeddings. Our results demonstrate that LiteVPNet achieves mean VMAF errors below 1.2 points across a wide range of quality targets. Notably, LiteVPNet achieves VMAF errors within 2 points for over 87% of our test corpus, c.f. approx 61% with state-of-the-art methods. LiteVPNet's performance across various quality regions highlights its applicability for enhancing high-value content transport and streaming for more energy-efficient, high-quality media experiences.