🤖 AI Summary
This study addresses the lack of fair evaluation of visual State Space Models (SSMs) in remote sensing semantic segmentation. By standardizing a four-stage feature interface and employing a fixed lightweight decoder—varying only the encoder—the authors conduct rigorous controlled experiments on VMamba, MambaVision, and Spatial-Mamba. Evaluations on LoveDA and ISPRS Potsdam datasets reveal, for the first time, that while visual SSMs achieve a better accuracy–efficiency trade-off than CNN and Transformer baselines, their performance gains are fundamentally constrained by asymmetric cross-domain generalization, limited boundary delineation capability, and diminishing returns from model scaling. Further improvements hinge more critically on robust training strategies and boundary-aware decoder designs than on merely enlarging encoder capacity.
📝 Abstract
Visual state-space models (SSMs) are increasingly promoted as efficient alternatives to Vision Transformers, yet their practical advantages remain unclear under fair comparison because existing studies rarely isolate encoder effects from decoder and training choices. We present a strictly controlled benchmark of representative visual SSM families, including VMamba, MambaVision, and Spatial-Mamba, for remote-sensing semantic segmentation, in which only the encoder varies across experiments. Evaluated on LoveDA and ISPRS Potsdam under a unified 4-stage feature interface and a fixed lightweight decoder, the benchmark reveals three main findings, intra-family scaling yields only modest gains, cross-domain generalization is strongly asymmetric, and boundary delineation is the dominant failure mode under distribution shift. Although visual SSMs achieve favorable accuracy-efficiency trade-offs relative to the controlled CNN and Transformer baselines considered here, the results suggest that future improvements are more likely to come from robustness-oriented design and boundary-aware decoding than from encoder scaling alone. By isolating encoder behavior under a unified and reproducible protocol, this study establishes a practical reference benchmark for the design and evaluation of future Mamba-based segmentation backbones