🤖 AI Summary
This work addresses the limitation of existing visual state space models in explicitly controlling input-dependent memory behavior within compact backbones, where long scanning paths often exceed effective memory ranges. The authors propose a Structured Selective State Space Model that, for the first time, incorporates both real and complex conjugate poles into visual SSMs. By modulating pole radius and angle within bounded ranges, the model generates token-dependent stable poles, enabling an interpretable and adaptive memory mechanism. Combined with grouped pole sharing and a lightweight low-rank input pathway, the architecture maintains linear-complexity scanning. Experiments across image classification, semantic segmentation, and object detection demonstrate that the proposed model reduces computational complexity by up to 44% compared to Vision Mamba–based approaches while achieving comparable or superior accuracy.
📝 Abstract
State Space Models (SSMs) have emerged as a compelling alternative to attention models for long-range vision tasks, offering input-dependent recurrence with linear complexity. However, most efficient SSM variants reduce computation cost by modifying scan routes, resolutions, or traversal patterns, while largely leaving the recurrent dynamics implicit. Consequently, the model's state-dependent memory behavior is difficult to control, particularly in compact backbones where long scan paths can exceed the effective memory horizon. We propose Token-Conditioned Poles SSM (TCP-SSM), a structured selective SSM framework that improves efficiency while making recurrence dynamics explicit and interpretable through stable poles. TCP-SSM builds each scan operator with 1) real poles that model monotone or sign-alternating decay, and 2) complex-conjugate poles that capture damped oscillatory responses. Using bounded radius and angle modulation, TCP-SSM converts shared base poles into token-dependent poles, allowing each scan step to adapt its memory behavior to the current visual token while preserving pole stability. For practical scalability, we integrate grouped pole sharing with a lightweight low-rank input pathway, yielding an efficient scan operator that preserves linear-time scan complexity. Across image classification, semantic segmentation, and object detection, TCP-SSM reduces SSM computation complexity up to 44% in Vision Mamba-style models while maintaining or surpassing baseline accuracy.