Can Graphs Help Vision SSMs See Better?

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Existing Vision State Space Models (Vision SSMs) flatten images into one-dimensional token sequences without explicitly modeling local semantic neighborhoods, limiting their representational capacity. This work proposes GraphScan—the first dynamic operator to integrate graph structures into the Vision SSM scanning process. GraphScan constructs a local graph for each token, learns feature-conditioned affinities, and performs a single step of message passing to enable semantics-driven local aggregation before feeding tokens into a selective SSM for global modeling. This approach transforms geometric serialization into semantic routing, yielding interpretable displacement fields while preserving linear complexity and token count. Equipped with GraphScan, GraphScan-Mamba achieves state-of-the-art performance among Vision SSMs across image classification, object detection, instance segmentation, and semantic segmentation, with only marginal computational overhead.

📝 Abstract

Vision state space models inherit the efficiency and long-range modeling ability of Mamba-style selective scans. However, their performance depends critically on the representation of two-dimensional visual features as one-dimensional token sequences. Existing scan operators range from predefined geometric traversals to dynamic coordinate-based samplers that reroute tokens through predicted offsets and interpolation. While effective, these mechanisms primarily adapt paths or sampling locations, rather than explicitly modeling which local patches should exchange information before global state-space mixing. This motivates a simple question: \emph{can graphs help vision state space models see better?} We introduce \textbf{GraphScan}, a graph-induced dynamic scanning operator for Vision SSMs. For each token, GraphScan constructs a spatially bounded local graph, learns feature-conditioned affinities with relative positional bias, and produces the output token by one-step message passing over its semantic neighborhood. The resulting tokens are locally grounded before being processed by the selective SSM for global aggregation. GraphScan preserves token count and linear scaling in image size, while replacing coordinate-conditioned interpolation with feature-conditioned semantic routing. Integrated into a hierarchical backbone, \textbf{GraphScan-Mamba} achieves state-of-the-art performance among Vision SSMs across image classification, object detection, instance segmentation, and semantic segmentation, with modest computational overhead. Our analysis further shows that GraphScan induces interpretable displacement fields over the token lattice, providing a semantic and spatially grounded view of dynamic scanning. These results suggest that future Vision SSMs should treat scanning not merely as geometric serialization, but as learned local semantic routing before global state-space modeling.

Problem

Research questions and friction points this paper is trying to address.

Vision State Space Models

Dynamic Scanning

Local Semantic Routing

Token Sequence Representation

Graph-based Modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

GraphScan

Vision State Space Models

Semantic Routing