🤖 AI Summary
This work addresses the limitations of conventional RWKV in remote sensing image fusion, which relies on semantically agnostic raster scanning and thus struggles to model multi-granularity semantic structures while being susceptible to positional bias. To overcome these issues, the authors propose a multi-granularity semantic prototype scanning paradigm that integrates a high-order RWKV architecture with a triadic prompt learning framework. Specifically, semantic prototype tokens are generated via locality-sensitive hashing–guided clustering, and a three-component prompting mechanism—comprising global, prototype, and register tokens—is introduced. Additionally, an invertible multi-scale Q-shift operation is devised to enhance high-frequency detail modeling without expanding the receptive field. The proposed method achieves significant performance gains over state-of-the-art approaches across multiple remote sensing benchmarks, simultaneously improving spatial resolution and spectral fidelity.
📝 Abstract
In this work, we propose a Multigrain-aware Semantic Prototype Scanning paradigm for pan-sharpening, built upon a high-order RWKV architecture and a tri-token prompting mechanism derived from semantic clustering. Specifically, our method contains three key components: 1) Multigrain-aware Semantic Prototype Scanning. Although RWKV offers a efficient linear-complexity alternative to Transformers, its conventional bidirectional raster scanning is still semantic-agnostic and prone to positional bias. To address this issue, we introduce a semantic-driven scanning strategy that leverages locality-sensitive hashing to group semantically related regions and construct multi-grain semantic prototypes, enabling context-aware token reordering and more coherent global interaction. 2) Tri-token Prompt Learning. We design a tri-token prompting mechanism consisting of a global token, cluster-derived prototype tokens, and a learnable register token. The global and prototype tokens provide complementary semantic priors for RWKV modeling, while the register token helps suppress noisy and artifact-prone intermediate representations. 3) Invertible Q-Shift. To counteract spatial details, we apply center difference convolution on the value pathway to inject high-frequency information, and introduce an invertible multi-scale Q-shift operation for efficient and lossless feature transformation without parameter-heavy receptive field expansion. Experimental results demonstrate the superiority of our method.