OCTOPUS: Enhancing the Spatial-Awareness of Vision SSMs with Multi-Dimensional Scans and Traversal Selection

📅 2026-01-31

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

This work addresses the limitation of existing Vision State Space Models (V-SSMs) in preserving local spatial structure due to their reliance on causal sequential modeling. To overcome this, the authors propose a multi-directional recurrent mechanism that performs multidimensional scanning and traversal selection across eight directions—horizontal, vertical, and diagonal—effectively balancing global contextual awareness with local spatial consistency while maintaining linear computational complexity. By integrating this mechanism, the model significantly enhances its perception of image spatial structure without compromising the inherent efficiency of SSMs. Experimental results demonstrate consistent improvements in boundary preservation and region coherence on image classification and segmentation tasks, achieving higher classification accuracy than current V-SSM approaches and validating the effectiveness and scalability of the proposed architecture.

Technology Category

Application Category

📝 Abstract

State space models (SSMs) have recently emerged as an alternative to transformers due to their unique ability of modeling global relationships in text with linear complexity. However, their success in vision tasks has been limited due to their causal formulation, which is suitable for sequential text but detrimental in the spatial domain where causality breaks the inherent spatial relationships among pixels or patches. As a result, standard SSMs fail to capture local spatial coherence, often linking non-adjacent patches while ignoring neighboring ones that are visually correlated. To address these limitations, we introduce OCTOPUS , a novel architecture that preserves both global context and local spatial structure within images, while maintaining the linear complexity of SSMs. OCTOPUS performs discrete reoccurrence along eight principal orientations, going forward or backward in the horizontal, vertical, and diagonal directions, allowing effective information exchange across all spatially connected regions while maintaining independence among unrelated patches. This design enables multi-directional recurrence, capturing both global context and local spatial structure with SSM-level efficiency. In our classification and segmentation benchmarks, OCTOPUS demonstrates notable improvements in boundary preservation and region consistency, as evident from the segmentation results, while maintaining relatively better classification accuracy compared to existing V-SSM based models. These results suggest that OCTOPUS appears as a foundation method for multi-directional recurrence as a scalable and effective mechanism for building spatially aware and computationally efficient vision architectures.

Problem

Research questions and friction points this paper is trying to address.

Vision State Space Models

Spatial Awareness

Local Spatial Coherence

Causal Formulation

Image Structure

Innovation

Methods, ideas, or system contributions that make the work stand out.

State Space Models

Spatial Awareness

Multi-Directional Recurrence