🤖 AI Summary
This work addresses the challenge in image super-resolution where conventional Transformers struggle to balance computational efficiency and global modeling due to the quadratic complexity of self-attention, while window-based attention suffers from limited receptive fields. To overcome this, the authors propose the Selective Aggregation Transformer (SAT), which introduces novel density and isolation metrics to guide efficient compression of key-value tokens while preserving full-resolution queries. By representing salient regions with a single aggregated token, SAT reduces token count by up to 97% yet retains high-frequency details and expands the effective receptive field. Experiments demonstrate that SAT outperforms the current state-of-the-art method PFT by 0.22 dB in PSNR while achieving up to a 27% reduction in FLOPs.
📝 Abstract
Transformer-based approaches have revolutionized image super-resolution by modeling long-range dependencies. However, the quadratic computational complexity of vanilla self-attention mechanisms poses significant challenges, often leading to compromises between efficiency and global context exploitation. Recent window-based attention methods mitigate this by localizing computations, but they often yield restricted receptive fields. To mitigate these limitations, we propose Selective Aggregation Transformer (SAT). This novel transformer efficiently captures long-range dependencies, leading to an enlarged model receptive field by selectively aggregating key-value matrices (reducing the number of tokens by 97\%) via our Density-driven Token Aggregation algorithm while maintaining the full resolution of the query matrix. This design significantly reduces computational costs, resulting in lower complexity and enabling scalable global interactions without compromising reconstruction fidelity. SAT identifies and represents each cluster with a single aggregation token, utilizing density and isolation metrics to ensure that critical high-frequency details are preserved. Experimental results demonstrate that SAT outperforms the state-of-the-art method PFT by up to 0.22dB, while the total number of FLOPs can be reduced by up to 27\%.