🤖 AI Summary
Video super-resolution (VSR) faces the dual challenge of modeling non-local spatiotemporal dependencies across frames while maintaining computational efficiency—particularly under large motion displacements and long video sequences, where optical-flow-based methods and Transformers exhibit limitations. This paper proposes the first SSM-based VSR framework. Its core contributions are: (1) Shared Compass Construction (SCC) and Content-Aware Serialization (CAS), which replace Mamba’s rigid 1D scanning with dynamic spatiotemporal interaction; and (2) a Global-Local State Space Block (GLSSB) that synergistically integrates windowed self-attention with SSM-driven feature propagation. On the REDS dataset, our method achieves a 0.58 dB PSNR gain over state-of-the-art Transformer-based approaches while reducing parameter count by 55%, significantly improving both reconstruction accuracy and efficiency for videos with large motions and extended temporal extents.
📝 Abstract
Video super-resolution (VSR) faces critical challenges in effectively modeling non-local dependencies across misaligned frames while preserving computational efficiency. Existing VSR methods typically rely on optical flow strategies or transformer architectures, which struggle with large motion displacements and long video sequences. To address this, we propose MambaVSR, the first state-space model framework for VSR that incorporates an innovative content-aware scanning mechanism. Unlike rigid 1D sequential processing in conventional vision Mamba methods, our MambaVSR enables dynamic spatiotemporal interactions through the Shared Compass Construction (SCC) and the Content-Aware Sequentialization (CAS). Specifically, the SCC module constructs intra-frame semantic connectivity graphs via efficient sparse attention and generates adaptive spatial scanning sequences through spectral clustering. Building upon SCC, the CAS module effectively aligns and aggregates non-local similar content across multiple frames by interleaving temporal features along the learned spatial order. To bridge global dependencies with local details, the Global-Local State Space Block (GLSSB) synergistically integrates window self-attention operations with SSM-based feature propagation, enabling high-frequency detail recovery under global dependency guidance. Extensive experiments validate MambaVSR's superiority, outperforming the Transformer-based method by 0.58 dB PSNR on the REDS dataset with 55% fewer parameters.