🤖 AI Summary
Traditional video super-resolution (VSR) methods suffer from explicit inter-frame motion alignment, decoupled spatial-temporal modeling, and sensitivity to motion estimation errors. To address these limitations, this paper proposes a novel continuous spatiotemporal VSR paradigm: modeling video as a 3D Video Fourier Field (VFF), enabling joint continuous implicit representation of space and time. We introduce the first 3D VFF formulation—eliminating explicit optical flow or deformation compensation—and support arbitrary spatiotemporal coordinate sampling and aliasing-free reconstruction. A neural encoder predicts differentiable Fourier basis coefficients, incorporating large-receptive-field architecture to capture long-range spatiotemporal dependencies. Additionally, an analytically derived Gaussian point-spread function is integrated to suppress spectral aliasing. Our method achieves state-of-the-art performance across multiple benchmarks, significantly improving spatial detail fidelity and temporal consistency, while enabling arbitrary scaling factors and efficient inference.
📝 Abstract
We introduce a novel formulation for continuous space-time video super-resolution. Instead of decoupling the representation of a video sequence into separate spatial and temporal components and relying on brittle, explicit frame warping for motion compensation, we encode video as a continuous, spatio-temporally coherent 3D Video Fourier Field (VFF). That representation offers three key advantages: (1) it enables cheap, flexible sampling at arbitrary locations in space and time; (2) it is able to simultaneously capture fine spatial detail and smooth temporal dynamics; and (3) it offers the possibility to include an analytical, Gaussian point spread function in the sampling to ensure aliasing-free reconstruction at arbitrary scale. The coefficients of the proposed, Fourier-like sinusoidal basis are predicted with a neural encoder with a large spatio-temporal receptive field, conditioned on the low-resolution input video. Through extensive experiments, we show that our joint modeling substantially improves both spatial and temporal super-resolution and sets a new state of the art for multiple benchmarks: across a wide range of upscaling factors, it delivers sharper and temporally more consistent reconstructions than existing baselines, while being computationally more efficient. Project page: https://v3vsr.github.io.