🤖 AI Summary
This work addresses low-latency causal video super-resolution (VSR) under H.265 compression for video conferencing, targeting perceptual quality enhancement of three LR video types: generic videos, portrait close-ups, and screen content. To this end, we propose a causal temporal propagation network, a lightweight spatio-temporal feature alignment mechanism, and an H.265 distortion modeling approach. We further introduce the first open-source, conference-oriented screen-content VSR dataset. Subjective evaluation follows ITU-T P.910 guidelines via crowdsourcing. Experiments demonstrate that our method—and top-performing competition solutions—significantly outperform bilinear interpolation and non-causal VSR baselines under strict low-latency constraints. The results advance the practical deployment of real-time VSR under realistic codec conditions.
📝 Abstract
Super-Resolution (SR) is a critical task in computer vision, focusing on reconstructing high-resolution (HR) images from low-resolution (LR) inputs. The field has seen significant progress through various challenges, particularly in single-image SR. Video Super-Resolution (VSR) extends this to the temporal domain, aiming to enhance video quality using methods like local, uni-, bi-directional propagation, or traditional upscaling followed by restoration. This challenge addresses VSR for conferencing, where LR videos are encoded with H.265 at fixed QPs. The goal is to upscale videos by a specific factor, providing HR outputs with enhanced perceptual quality under a low-delay scenario using causal models. The challenge included three tracks: general-purpose videos, talking head videos, and screen content videos, with separate datasets provided by the organizers for training, validation, and testing. We open-sourced a new screen content dataset for the SR task in this challenge. Submissions were evaluated through subjective tests using a crowdsourced implementation of the ITU-T Rec P.910.