🤖 AI Summary
This work addresses the degradation of visual perceptual quality caused by compression artifacts introduced by standard video codecs. To mitigate this issue, the authors propose a lightweight, codec-agnostic, semantic-aware preprocessing framework that integrates semantic embeddings from vision-language models with a differentiable codec proxy. The framework employs an efficient convolutional network to enable end-to-end training, selectively enhancing detail fidelity in perceptually salient regions without altering the existing encoding pipeline. Experimental results demonstrate that the proposed method significantly outperforms baseline approaches on high-resolution benchmarks, achieving consistent improvements in both MS-SSIM and VMAF metrics while effectively preserving texture details in visually critical areas.
📝 Abstract
Compression artifacts from standard video codecs often degrade perceptual quality. We propose a lightweight, semantic-aware pre-processing framework that enhances perceptual fidelity by selectively addressing these distortions. Our method integrates semantic embeddings from a vision-language model into an efficient convolutional architecture, prioritizing the preservation of perceptually significant structures. The model is trained end-to-end with a differentiable codec proxy, enabling it to mitigate artifacts from various standard codecs without modifying the existing video pipeline. During inference, the codec proxy is discarded, and SCENE operates as a standalone pre-processor, enabling real-time performance. Experiments on high-resolution benchmarks show improved performance over baselines in both objective (MS-SSIM) and perceptual (VMAF) metrics, with notable gains in preserving detailed textures within salient regions. Our results show that semantic-guided, codec-aware pre-processing is an effective approach for enhancing compressed video streams.