🤖 AI Summary
Visual Mamba models lose inherent 2D spatial structure during image patch serialization, causing state space modeling to be implicitly biased by sequence ordering. Method: We propose the first systematic framework for visualizing and quantifying this effect, integrating interpretability tools—including patch-reordering experiments, attention heatmaps, and state trajectory tracking—to analyze how selective scanning operates in vision contexts. Contribution/Results: We empirically demonstrate that the state evolution in SSM layers exhibits strong dependence on both original 2D positions and sequence ordering—contradicting the conventional “sequence-agnostic” modeling assumption. Crucially, we reveal that selective scanning possesses pronounced local spatial preference, and that alternative patch-ordering strategies substantially alter cross-patch attention distributions. These findings provide critical empirical evidence and methodological foundations for architecture design, diagnostic optimization, and interpretability research in visual Mamba models.
📝 Abstract
State space models (SSMs) have emerged as an efficient alternative to transformer-based models, offering linear complexity that scales better than transformers. One of the latest advances in SSMs, Mamba, introduces a selective scan mechanism that assigns trainable weights to input tokens, effectively mimicking the attention mechanism. Mamba has also been successfully extended to the vision domain by decomposing 2D images into smaller patches and arranging them as 1D sequences. However, it remains unclear how these patches interact with (or attend to) each other in relation to their original 2D spatial location. Additionally, the order used to arrange the patches into a sequence also significantly impacts their attention distribution. To better understand the attention between patches and explore the attention patterns, we introduce a visual analytics tool specifically designed for vision-based Mamba models. This tool enables a deeper understanding of how attention is distributed across patches in different Mamba blocks and how it evolves throughout a Mamba model. Using the tool, we also investigate the impact of different patch-ordering strategies on the learned attention, offering further insights into the model's behavior.