🤖 AI Summary
To address perceptual latency degradation in real-time video transmission, which impairs interactive user experience, this paper proposes IFRVP—a zero-latency video prediction framework. Methodologically, we design IFRNet, a lightweight convolutional architecture incorporating ELAN-based residual modules to balance accuracy and efficiency, and introduce three novel frame interpolation training paradigms specifically tailored for predictive tasks. Furthermore, we propose a mid-level feature refinement mechanism to enable end-to-end inter-frame interpolation modeling. Experimental results demonstrate that IFRVP achieves a state-of-the-art trade-off between prediction accuracy and inference speed, enabling real-time prediction at over 30 FPS and significantly reducing end-to-end perceptual latency. The source code and demonstration videos are publicly available.
📝 Abstract
Transmission latency significantly affects users' quality of experience in real-time interaction and actuation. As latency is principally inevitable, video prediction can be utilized to mitigate the latency and ultimately enable zero-latency transmission. However, most of the existing video prediction methods are computationally expensive and impractical for real-time applications. In this work, we therefore propose real-time video prediction towards the zero-latency interaction over networks, called IFRVP (Intermediate Feature Refinement Video Prediction). Firstly, we propose three training methods for video prediction that extend frame interpolation models, where we utilize a simple convolution-only frame interpolation network based on IFRNet. Secondly, we introduce ELAN-based residual blocks into the prediction models to improve both inference speed and accuracy. Our evaluations show that our proposed models perform efficiently and achieve the best trade-off between prediction accuracy and computational speed among the existing video prediction methods. A demonstration movie is also provided at http://bit.ly/IFRVPDemo.