🤖 AI Summary
This work addresses the limitations of existing visual state space models (SSMs), which rely on fixed scanning orders that disrupt spatial adjacency and object continuity, leading to significant performance degradation under geometric transformations such as rotation. To overcome this, the authors propose PRISMamba, the first method to systematically analyze the impact of scanning order on visual SSMs. PRISMamba introduces a concentric ring partitioning strategy with order-agnostic aggregation within each ring, complemented by a short radial SSM to propagate contextual information across rings. Additionally, a partial channel filtering mechanism is designed to enhance computational efficiency. The model achieves 84.5% Top-1 accuracy on ImageNet-1K with 3.9G FLOPs and a throughput of 3,054 images per second on an A100 GPU, outperforming VMamba while exhibiting only a 1–2% accuracy drop under rotational perturbations, thus achieving a strong balance among accuracy, efficiency, and geometric robustness.
📝 Abstract
State Space Models (SSMs) have emerged as efficient alternatives to attention for vision tasks, offering lineartime sequence processing with competitive accuracy. Vision SSMs, however, require serializing 2D images into 1D token sequences along a predefined scan order, a factor often overlooked. We show that scan order critically affects performance by altering spatial adjacency, fracturing object continuity, and amplifying degradation under geometric transformations such as rotation. We present Partial RIng Scan Mamba (PRISMamba), a rotation-robust traversal that partitions an image into concentric rings, performs order-agnostic aggregation within each ring, and propagates context across rings through a set of short radial SSMs. Efficiency is further improved via partial channel filtering, which routes only the most informative channels through the recurrent ring pathway while keeping the rest on a lightweight residual branch. On ImageNet-1K, PRISMamba achieves 84.5% Top-1 with 3.9G FLOPs and 3,054 img/s on A100, outperforming VMamba in both accuracy and throughput while requiring fewer FLOPs. It also maintains performance under rotation, whereas fixed-path scans drop by 1~2%. These results highlight scan-order design, together with channel filtering, as a crucial, underexplored factor for accuracy, efficiency, and rotation robustness in Vision SSMs. Code will be released upon acceptance.