🤖 AI Summary
Traditional recurrent video super-resolution (VSR) methods suffer from gradient vanishing and poor parallelism, while causal Mamba-based models are inherently limited in modeling fine-grained spatial dependencies. To address these issues, we propose an efficient hybrid spatiotemporal modeling architecture. Our approach features: (1) a Gather-Scatter Mamba mechanism that aligns neighboring frame features to the central frame within a temporal window before aggregation and scattering, mitigating occlusion artifacts and enhancing feature redistribution; and (2) integration of shifted-window self-attention to explicitly capture local spatial dependencies, compensating for Mamba’s structural constraints. The architecture retains linear time complexity while enabling precise spatiotemporal feature propagation. Extensive experiments demonstrate state-of-the-art performance on multiple VSR benchmarks, along with significantly accelerated inference—effectively balancing accuracy and efficiency.
📝 Abstract
State Space Models (SSMs)-most notably RNNs-have historically played a central role in sequential modeling. Although attention mechanisms such as Transformers have since dominated due to their ability to model global context, their quadratic complexity and limited scalability make them less suited for long sequences. Video super-resolution (VSR) methods have traditionally relied on recurrent architectures to propagate features across frames. However, such approaches suffer from well-known issues including vanishing gradients, lack of parallelism, and slow inference speed. Recent advances in selective SSMs like Mamba offer a compelling alternative: by enabling input-dependent state transitions with linear-time complexity, Mamba mitigates these issues while maintaining strong long-range modeling capabilities. Despite this potential, Mamba alone struggles to capture fine-grained spatial dependencies due to its causal nature and lack of explicit context aggregation. To address this, we propose a hybrid architecture that combines shifted window self-attention for spatial context aggregation with Mamba-based selective scanning for efficient temporal propagation. Furthermore, we introduce Gather-Scatter Mamba (GSM), an alignment-aware mechanism that warps features toward a center anchor frame within the temporal window before Mamba propagation and scatters them back afterward, effectively reducing occlusion artifacts and ensuring effective redistribution of aggregated information across all frames. The official implementation is provided at: https://github.com/Ko-Lani/GSMamba.